SlideShare ist ein Scribd-Unternehmen logo
1 von 43
ember
an open source
malware classifier and
dataset
whoami
Phil Roth
Data Scientist
@mrphilroth
proth@endgame.com
Learned ML at IceCube
Applying it at Endgame
whoami
Hyrum Anderson
Technical Director of Data Science
@drhyrum
Open datasets push ML research
forward
source: https://twitter.com/benhamner/status/938123380074610688
Datasets cited in NIPS papers over time
One example: MNIST
MNIST: http://yann.lecun.com/exdb/mnist/
Database of 70k (60k/10k
training/test split) images of
handwritten digits
“MNIST is the new unit test” –Ian
Goodfellow
Even when the dataset can no
longer effectively measure
performance improvements, it’s
still useful as a sanity check.
Another example: CIFAR 10/100
CIFAR-10:
Database of 60k (50k/10k training/test
split) images of 10 different classes
CIFAR-100:
60k images of 100 different classes
CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
Security lacks these datasets
2014 Corporate Blog
2015 RSA FloorTalk
Reasons security lacks these
datasets
Personally identifiable information
Communicating vulnerabilities to attackers
Intellectual property
Existing Security Datasets
http://www.secrepo.com/Mike Sconzo’s
DGA Detection
Domain generation algorithms create large numbers of domain names to serve as
rendezvous for C&C servers.
Datasets available:
AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/
DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt
Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
Network Intrusion Detection
Unsupervised learning problem looking for anomalous network events. (To me, this
turns into an alert ordering problem)
Datasets available:
DARPA Datasets:
https://www.ll.mit.edu//ideval/data/1998data.html
https://www.ll.mit.edu//ideval/data/1999data.html
KDD Cup 1999:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
OLD!!!!
Static Classification of Malware
Basically the antivirus problem solved with machine learning.
Datasets available:
Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/
VirusShare [Malicious Only]: https://virusshare.com/
Microsoft Malware Challenge [Malicious Only. Headers Stripped]:
https://www.kaggle.com/c/malware-classification
Static Classification of Malware
Benign and malicious samples can
be distributed in a feature space
(using attributes like file size and
number of imports)
Goal is to predict samples that we
haven’t seen yet
Static Classification of Malware
AYARA rule can divide these two
classes. But a simple rule won’t be
generalizable.
Static Classification of Malware
A machine learning model can
define a better boundary that
makes more accurate predictions
There are so many options for
machine learning algorithms. How
do we know which one is best?
Endgame Malware BEnchmark for Research
“MNIST for malware”
ember
“I know... But, if I tried to avoid
the name of every Javascript
framework, there wouldn’t be
any names left.”
Endgame Malware BEnchmark for Research
An open source collection of 1.1 million PE File sha256 hashes that were
scanned by VirusTotal sometime in 2017.
The dataset includes metadata, derived features from the PE files, a model
trained on those features, and accompanying code.
It does NOT include the files themselves.
ember
The dataset is divided into a 900k training set and a
200k testing set
Training set includes 300k of benign, malicious, and
unlabeled samples
data
Training set data appears
chronologically prior to the test data
Date metadata allows:
• Chronological cross validation
• Quantifying model performance
degradation over time
train test
data
7 JSON line files containing extracted features
data
[proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2
-rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2
[proth@proth-mbp data]$ cd ember
[proth@proth-mbp ember]$ ls -lh
total 9.2G
-rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl
-rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl
-rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl
-rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl
-rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
First three keys of each line is metadata
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4
{
"sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2",
"appeared": "2006-12",
"label": 0,
The rest of the keys are feature categories
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256,
.appeared, .label)" | jq "keys"
[
"byteentropy",
"exports",
"general",
"header",
"histogram",
"imports",
"section",
"strings"
]
features
Two kinds of features:
Calculated from raw bytes
Calculated from lief parsing
the PE file format
https://lief.quarkslab.com/
https://lief.quarkslab.com/doc/Intro.html
https://github.com/lief-project/LIEF
features
Raw features are calculated from
the bytes and the lief object
Vectorized features are calculated
from the raw features
features
• Byte Histogram (histogram)
A simple counting of how many times each byte occurs
• Byte Entropy Histogram (byteentropy)
Sliding window entropy calculation
Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
features
• Section Information (section)
Entry section and a list of all sections with name, size, entropy, and other information given
given for each
features
• Import Information (imports)
Each library imported from along with imported function names
• Export Information (exports)
Exported function names
features
• String Information (strings)
Number of strings, average length, character histogram, number of strings that
match various patterns like URLs, MZ header, or registry keys
features
• General Information (general)
Number of imports, exports, symbols and whether the file has relocations,
resources, or a signature
features
• Header Information (header)
Details about the machine the file was compiled on. Versions of linkers, images,
and operating system. etc…
vectorization
After downloading the dataset, feature vectorization is a necessary
step before model training
The ember codebase defines how each feature is hashed into a
vector using scikit-learn tools (FeatureHasher function)
Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
model
Gradient Boosted DecisionTree model trained with
LightGBM on labeled samples
Model training took 3 hours on my 2015 MacBook
Pro i7
import lightgbm as lgb
X_train, y_train = read_vectorized_features(data_dir, subset="train”)
train_rows = (y_train != -1)
lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows])
lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
model
Ember Model Performance:
ROC AUC: 0.9991123269999999
Threshold: 0.871
False Positive Rate: 0.099%
False Negative Rate: 7.009%
Detection Rate: 92.991%
disclaimer
This model is NOT MalwareScore
MalwareScore:
is better optimized
has better features
performs better
is constantly updated with new data
is the best option for protecting your endpoints (in my totally biased opinion)
code
https://github.com/endgameinc/ember
The ember repo makes
it easy to:
• Vectorize features
• Train the model
• Make predictions on
new PE files
notebook
The Jupyter notebook will
reproduce the graphics from
this talk from the extracted
dataset
suggestions
To beat the benchmark model performance:
Use feature selection techniques to eliminate misleading features
Do feature engineering to find better features
Optimize LightGBM model parameters with grid search
Incorporate information from unlabeled samples into training
suggestions
To further research in the field of ML for static malware
detection:
Quantify model performance degradation through time
Build and compare the performance of featureless neural network
based models (need independent access to samples)
An adversarial network could create or modify PE files to bypass
ember model classification
demo time!
ember
Highlight: “Evidently, despite increased model size and computational
burden, featureless deep learning models have yet to eclipse the
performance of models that leverage domain knowledge via parsed
features.”
Read the paper:
https://arxiv.org/abs/1804.04637
ember
Download the data:
https://pubdata.endgame.com/ember/ember_dataset.tar.bz2
Download the code:
https://github.com/endgameinc/ember
THANKYOU!
Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
RedisConf18 - Introducing RediSearch Aggregations
RedisConf18 - Introducing RediSearch AggregationsRedisConf18 - Introducing RediSearch Aggregations
RedisConf18 - Introducing RediSearch Aggregations
 
ClickHouse北京Meetup ClickHouse Best Practice @Sina
ClickHouse北京Meetup ClickHouse Best Practice @SinaClickHouse北京Meetup ClickHouse Best Practice @Sina
ClickHouse北京Meetup ClickHouse Best Practice @Sina
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331
 
Dropout as a Bayesian Approximation
Dropout as a Bayesian ApproximationDropout as a Bayesian Approximation
Dropout as a Bayesian Approximation
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
 
The Performance Engineer's Guide To HotSpot Just-in-Time Compilation
The Performance Engineer's Guide To HotSpot Just-in-Time CompilationThe Performance Engineer's Guide To HotSpot Just-in-Time Compilation
The Performance Engineer's Guide To HotSpot Just-in-Time Compilation
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
[Outdated] Secrets of Performance Tuning Java on Kubernetes
[Outdated] Secrets of Performance Tuning Java on Kubernetes[Outdated] Secrets of Performance Tuning Java on Kubernetes
[Outdated] Secrets of Performance Tuning Java on Kubernetes
 
האלים ביוון העתיקה
האלים ביוון העתיקההאלים ביוון העתיקה
האלים ביוון העתיקה
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Jonathan Ronen - Variational Autoencoders tutorial
Jonathan Ronen - Variational Autoencoders tutorialJonathan Ronen - Variational Autoencoders tutorial
Jonathan Ronen - Variational Autoencoders tutorial
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
Uncertainty in Deep Learning
Uncertainty in Deep LearningUncertainty in Deep Learning
Uncertainty in Deep Learning
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Malware classification using Machine Learning
Malware classification using Machine LearningMalware classification using Machine Learning
Malware classification using Machine Learning
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Building a Social Network with MongoDB
Building a Social Network with MongoDBBuilding a Social Network with MongoDB
Building a Social Network with MongoDB
 

Ähnlich wie Ember

Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Sease
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
Stepan Pushkarev
 

Ähnlich wie Ember (20)

PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
MLBox
MLBoxMLBox
MLBox
 
MLBox 0.8.2
MLBox 0.8.2 MLBox 0.8.2
MLBox 0.8.2
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
CSL0777-L07.pptx
CSL0777-L07.pptxCSL0777-L07.pptx
CSL0777-L07.pptx
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
[AWS Innovate 온라인 컨퍼런스] 간단한 Python 코드만으로 높은 성능의 기계 학습 모델 만들기 - 김무현, AWS Sr.데이...
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBox
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Ember

  • 1. ember an open source malware classifier and dataset
  • 3. whoami Hyrum Anderson Technical Director of Data Science @drhyrum
  • 4. Open datasets push ML research forward source: https://twitter.com/benhamner/status/938123380074610688 Datasets cited in NIPS papers over time
  • 5. One example: MNIST MNIST: http://yann.lecun.com/exdb/mnist/ Database of 70k (60k/10k training/test split) images of handwritten digits “MNIST is the new unit test” –Ian Goodfellow Even when the dataset can no longer effectively measure performance improvements, it’s still useful as a sanity check.
  • 6. Another example: CIFAR 10/100 CIFAR-10: Database of 60k (50k/10k training/test split) images of 10 different classes CIFAR-100: 60k images of 100 different classes CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
  • 7. Security lacks these datasets 2014 Corporate Blog 2015 RSA FloorTalk
  • 8. Reasons security lacks these datasets Personally identifiable information Communicating vulnerabilities to attackers Intellectual property
  • 10. DGA Detection Domain generation algorithms create large numbers of domain names to serve as rendezvous for C&C servers. Datasets available: AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/ DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
  • 11. Network Intrusion Detection Unsupervised learning problem looking for anomalous network events. (To me, this turns into an alert ordering problem) Datasets available: DARPA Datasets: https://www.ll.mit.edu//ideval/data/1998data.html https://www.ll.mit.edu//ideval/data/1999data.html KDD Cup 1999: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html OLD!!!!
  • 12. Static Classification of Malware Basically the antivirus problem solved with machine learning. Datasets available: Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/ VirusShare [Malicious Only]: https://virusshare.com/ Microsoft Malware Challenge [Malicious Only. Headers Stripped]: https://www.kaggle.com/c/malware-classification
  • 13. Static Classification of Malware Benign and malicious samples can be distributed in a feature space (using attributes like file size and number of imports) Goal is to predict samples that we haven’t seen yet
  • 14. Static Classification of Malware AYARA rule can divide these two classes. But a simple rule won’t be generalizable.
  • 15. Static Classification of Malware A machine learning model can define a better boundary that makes more accurate predictions There are so many options for machine learning algorithms. How do we know which one is best?
  • 16. Endgame Malware BEnchmark for Research “MNIST for malware” ember
  • 17. “I know... But, if I tried to avoid the name of every Javascript framework, there wouldn’t be any names left.”
  • 18. Endgame Malware BEnchmark for Research An open source collection of 1.1 million PE File sha256 hashes that were scanned by VirusTotal sometime in 2017. The dataset includes metadata, derived features from the PE files, a model trained on those features, and accompanying code. It does NOT include the files themselves. ember
  • 19. The dataset is divided into a 900k training set and a 200k testing set Training set includes 300k of benign, malicious, and unlabeled samples data
  • 20. Training set data appears chronologically prior to the test data Date metadata allows: • Chronological cross validation • Quantifying model performance degradation over time train test data
  • 21. 7 JSON line files containing extracted features data [proth@proth-mbp data]$ ls -lh ember_dataset.tar.bz2 -rw-r--r-- 1 proth staff 1.6G Apr 5 11:38 ember_dataset.tar.bz2 [proth@proth-mbp data]$ cd ember [proth@proth-mbp ember]$ ls -lh total 9.2G -rw-r--r-- 1 proth staff 1.6G Apr 6 16:03 test_features.jsonl -rw-r--r-- 1 proth staff 426M Apr 6 16:03 train_features_0.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_1.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_2.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_3.jsonl -rw-r--r-- 1 proth staff 1.5G Apr 6 16:03 train_features_4.jsonl -rw-r--r-- 1 proth staff 1.4G Apr 6 16:03 train_features_5.jsonl
  • 22. First three keys of each line is metadata data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4 { "sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2", "appeared": "2006-12", "label": 0,
  • 23. The rest of the keys are feature categories data [proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256, .appeared, .label)" | jq "keys" [ "byteentropy", "exports", "general", "header", "histogram", "imports", "section", "strings" ]
  • 24. features Two kinds of features: Calculated from raw bytes Calculated from lief parsing the PE file format https://lief.quarkslab.com/ https://lief.quarkslab.com/doc/Intro.html https://github.com/lief-project/LIEF
  • 25. features Raw features are calculated from the bytes and the lief object Vectorized features are calculated from the raw features
  • 26. features • Byte Histogram (histogram) A simple counting of how many times each byte occurs • Byte Entropy Histogram (byteentropy) Sliding window entropy calculation Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
  • 27. features • Section Information (section) Entry section and a list of all sections with name, size, entropy, and other information given given for each
  • 28. features • Import Information (imports) Each library imported from along with imported function names • Export Information (exports) Exported function names
  • 29. features • String Information (strings) Number of strings, average length, character histogram, number of strings that match various patterns like URLs, MZ header, or registry keys
  • 30. features • General Information (general) Number of imports, exports, symbols and whether the file has relocations, resources, or a signature
  • 31. features • Header Information (header) Details about the machine the file was compiled on. Versions of linkers, images, and operating system. etc…
  • 32. vectorization After downloading the dataset, feature vectorization is a necessary step before model training The ember codebase defines how each feature is hashed into a vector using scikit-learn tools (FeatureHasher function) Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
  • 33. model Gradient Boosted DecisionTree model trained with LightGBM on labeled samples Model training took 3 hours on my 2015 MacBook Pro i7 import lightgbm as lgb X_train, y_train = read_vectorized_features(data_dir, subset="train”) train_rows = (y_train != -1) lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows]) lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
  • 34. model Ember Model Performance: ROC AUC: 0.9991123269999999 Threshold: 0.871 False Positive Rate: 0.099% False Negative Rate: 7.009% Detection Rate: 92.991%
  • 35. disclaimer This model is NOT MalwareScore MalwareScore: is better optimized has better features performs better is constantly updated with new data is the best option for protecting your endpoints (in my totally biased opinion)
  • 36. code https://github.com/endgameinc/ember The ember repo makes it easy to: • Vectorize features • Train the model • Make predictions on new PE files
  • 37. notebook The Jupyter notebook will reproduce the graphics from this talk from the extracted dataset
  • 38. suggestions To beat the benchmark model performance: Use feature selection techniques to eliminate misleading features Do feature engineering to find better features Optimize LightGBM model parameters with grid search Incorporate information from unlabeled samples into training
  • 39. suggestions To further research in the field of ML for static malware detection: Quantify model performance degradation through time Build and compare the performance of featureless neural network based models (need independent access to samples) An adversarial network could create or modify PE files to bypass ember model classification
  • 41. ember Highlight: “Evidently, despite increased model size and computational burden, featureless deep learning models have yet to eclipse the performance of models that leverage domain knowledge via parsed features.” Read the paper: https://arxiv.org/abs/1804.04637
  • 42.
  • 43. ember Download the data: https://pubdata.endgame.com/ember/ember_dataset.tar.bz2 Download the code: https://github.com/endgameinc/ember THANKYOU! Phil Roth: @mrphilroth Hyrum Anderson: @drhyrum