This document summarizes the Ember malware classification benchmark and dataset. It provides an overview of the open source dataset that contains over 1 million malware samples and extracted features. The dataset is divided into training and test sets and includes metadata on the samples. Features are extracted from the raw bytes and via PE file parsing. A gradient boosted decision tree model is trained on the labeled samples and achieves over 99% ROC AUC on the test set. The code and a Jupyter notebook are available to reproduce the results and suggest areas for further research.
4. Open datasets push ML research
forward
source: https://twitter.com/benhamner/status/938123380074610688
Datasets cited in NIPS papers over time
5. One example: MNIST
MNIST: http://yann.lecun.com/exdb/mnist/
Database of 70k (60k/10k
training/test split) images of
handwritten digits
“MNIST is the new unit test” –Ian
Goodfellow
Even when the dataset can no
longer effectively measure
performance improvements, it’s
still useful as a sanity check.
6. Another example: CIFAR 10/100
CIFAR-10:
Database of 60k (50k/10k training/test
split) images of 10 different classes
CIFAR-100:
60k images of 100 different classes
CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html
10. DGA Detection
Domain generation algorithms create large numbers of domain names to serve as
rendezvous for C&C servers.
Datasets available:
AlexaTop 1 Million: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
DGA Archive: https://dgarchive.caad.fkie.fraunhofer.de/
DGA Domains: http://osint.bambenekconsulting.com/feeds/dga-feed.txt
Johannes Bacher's reversing: https://github.com/baderj/domain_generation_algorithms
11. Network Intrusion Detection
Unsupervised learning problem looking for anomalous network events. (To me, this
turns into an alert ordering problem)
Datasets available:
DARPA Datasets:
https://www.ll.mit.edu//ideval/data/1998data.html
https://www.ll.mit.edu//ideval/data/1999data.html
KDD Cup 1999:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
OLD!!!!
12. Static Classification of Malware
Basically the antivirus problem solved with machine learning.
Datasets available:
Drebin [Android]: https://www.sec.cs.tu-bs.de/~danarp/drebin/
VirusShare [Malicious Only]: https://virusshare.com/
Microsoft Malware Challenge [Malicious Only. Headers Stripped]:
https://www.kaggle.com/c/malware-classification
13. Static Classification of Malware
Benign and malicious samples can
be distributed in a feature space
(using attributes like file size and
number of imports)
Goal is to predict samples that we
haven’t seen yet
14. Static Classification of Malware
AYARA rule can divide these two
classes. But a simple rule won’t be
generalizable.
15. Static Classification of Malware
A machine learning model can
define a better boundary that
makes more accurate predictions
There are so many options for
machine learning algorithms. How
do we know which one is best?
17. “I know... But, if I tried to avoid
the name of every Javascript
framework, there wouldn’t be
any names left.”
18. Endgame Malware BEnchmark for Research
An open source collection of 1.1 million PE File sha256 hashes that were
scanned by VirusTotal sometime in 2017.
The dataset includes metadata, derived features from the PE files, a model
trained on those features, and accompanying code.
It does NOT include the files themselves.
ember
19. The dataset is divided into a 900k training set and a
200k testing set
Training set includes 300k of benign, malicious, and
unlabeled samples
data
20. Training set data appears
chronologically prior to the test data
Date metadata allows:
• Chronological cross validation
• Quantifying model performance
degradation over time
train test
data
22. First three keys of each line is metadata
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "." | head -n 4
{
"sha256": "0abb4fda7d5b13801d63bee53e5e256be43e141faa077a6d149874242c3f02c2",
"appeared": "2006-12",
"label": 0,
23. The rest of the keys are feature categories
data
[proth@proth-mbp ember]$ head -n 1 train_features_0.jsonl | jq "del(.sha256,
.appeared, .label)" | jq "keys"
[
"byteentropy",
"exports",
"general",
"header",
"histogram",
"imports",
"section",
"strings"
]
24. features
Two kinds of features:
Calculated from raw bytes
Calculated from lief parsing
the PE file format
https://lief.quarkslab.com/
https://lief.quarkslab.com/doc/Intro.html
https://github.com/lief-project/LIEF
25. features
Raw features are calculated from
the bytes and the lief object
Vectorized features are calculated
from the raw features
26. features
• Byte Histogram (histogram)
A simple counting of how many times each byte occurs
• Byte Entropy Histogram (byteentropy)
Sliding window entropy calculation
Details in Section 2.1.1: [Saxe, Berlin 2015] https://arxiv.org/pdf/1508.03096.pdf
27. features
• Section Information (section)
Entry section and a list of all sections with name, size, entropy, and other information given
given for each
28. features
• Import Information (imports)
Each library imported from along with imported function names
• Export Information (exports)
Exported function names
29. features
• String Information (strings)
Number of strings, average length, character histogram, number of strings that
match various patterns like URLs, MZ header, or registry keys
30. features
• General Information (general)
Number of imports, exports, symbols and whether the file has relocations,
resources, or a signature
31. features
• Header Information (header)
Details about the machine the file was compiled on. Versions of linkers, images,
and operating system. etc…
32. vectorization
After downloading the dataset, feature vectorization is a necessary
step before model training
The ember codebase defines how each feature is hashed into a
vector using scikit-learn tools (FeatureHasher function)
Feature vectorizing took 20 hours on my 2015 MacBook Pro i7
33. model
Gradient Boosted DecisionTree model trained with
LightGBM on labeled samples
Model training took 3 hours on my 2015 MacBook
Pro i7
import lightgbm as lgb
X_train, y_train = read_vectorized_features(data_dir, subset="train”)
train_rows = (y_train != -1)
lgbm_dataset = lgb.Dataset(X_train[train_rows], y_train[train_rows])
lgbm_model = lgb.train({"application": "binary"}, lgbm_dataset)
34. model
Ember Model Performance:
ROC AUC: 0.9991123269999999
Threshold: 0.871
False Positive Rate: 0.099%
False Negative Rate: 7.009%
Detection Rate: 92.991%
35. disclaimer
This model is NOT MalwareScore
MalwareScore:
is better optimized
has better features
performs better
is constantly updated with new data
is the best option for protecting your endpoints (in my totally biased opinion)
38. suggestions
To beat the benchmark model performance:
Use feature selection techniques to eliminate misleading features
Do feature engineering to find better features
Optimize LightGBM model parameters with grid search
Incorporate information from unlabeled samples into training
39. suggestions
To further research in the field of ML for static malware
detection:
Quantify model performance degradation through time
Build and compare the performance of featureless neural network
based models (need independent access to samples)
An adversarial network could create or modify PE files to bypass
ember model classification
41. ember
Highlight: “Evidently, despite increased model size and computational
burden, featureless deep learning models have yet to eclipse the
performance of models that leverage domain knowledge via parsed
features.”
Read the paper:
https://arxiv.org/abs/1804.04637