Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)

Microsoft Malware Classification Challenge
上位手法の紹介
佐野正太郎

アジェンダ
 コンペ概要
 ベースラインアプローチ
 ワードカウント & ランダムフォレスト
 上位手法の紹介
 特徴抽出
 特徴変換
 分類器
 優勝チームのモデル

コンペ概要
 タスク：マルウェアの分類
 入力：ヘキサダンプと逆アセンブリファイル
ヘキサダンプ
(.bytes)
逆アセンブリ
(.asm)

コンペ概要
.bytes
Your
Model
Malware
Class
Probabilities
.asm

コンペ概要
.bytes
Your
Model
Malware
Class
Probabilities
.asm
10,868training samples
1,000GB in total
9classes
10,873 test samples

コンペ概要
 クラス毎の確率を各サンプルに対して出力
 モデル評価：Log Loss






1
0
1
0
,,log log
1 N
i
K
k
kiki py
N
L

ベースラインアプローチ

 Beat the benchmark (~0.182) with RandomForest [4]
 ヘキサダンプからワードカウント特徴量抽出
 １バイト＝１単語
 そのままランダムフォレストに投げる
.bytes
Random
Forest
Classifier
1-byte
Word
Count
Malware
Classes
Probabilities

 コンペ初期からフォーラムに登場
 It was a surprise that one can achieve the accuracy of
0.96 just by using counts of ‘00’and, ‘FF’, and ‘??’. [3]

上位勢の特徴抽出
 ヘキサダンプからのワードカウント
 逆アセンブリからのワードカウント
 ハイブリッドワードカウント
 ファイルのメタデータ
 テクスチャ画像

ヘキサダンプからのワードカウント
 １バイトを１単語として扱う
 Nグラムモデルで性能が向上
 優勝チームのモデルでは４グラムまで取得
 １ラインを１単語とする方法も [2]

逆アセンブリからのワードカウント
ヘッダヘキサダンプコードアセンブリコード

インストラクションの
カウントをとる

セグメント名の
カウントをとる

 DLL関数のインポート情報を特徴量化

ハイブリッド特徴量
 DAF (Derived Assembly Features) 特徴量 [6]
(1) ヘキサダンプからNグラム特徴量抽出
(2) (1)を情報ゲインで絞り込み
(3) (2)と共起するアセンブリインストラクションを抽出
(4) (3)を情報ゲインで絞り込み
ヘキサダンプ特徴が
重要な場合のみ
インストラクションを
特徴としてカウント

ファイルのメタデータ
 ヘキサダンプファイルのサイズ
 逆アセンブリファイルのサイズ
 ヘキサンダンプファイルの圧縮レート
 逆アセンブリファイルの圧縮レート
 etc.

テクスチャ画像
 ヘキサダンプをグレースケール画像に変換
 １バイト＝１画素値
 適当な画像特徴量を抽出
 元論文ではGIST特徴量を使用[7]

上位勢の特徴変換
 TF-IDF
 情報ゲイン
 非負値行列因子分解
 ランダムフォレスト

TF-IDF
 単語頻度をドキュメント長で正規化
 小数のドキュメントにしか出現しない単語を強調
idftftfidf *
 

docword
worddoc
docword
docword
tf
'
,
}in'{#
}in{#
}includingdocs{#
docs}all{#
log
word
idfword 

情報ゲイン
 ある特徴を既知とした場合のエントロピーの差分
 計算の簡単化
 単語の頻度 => 出現したかどうかの二項値
 クラス毎に独立して特徴を選択
)|()()( xYHYHxGain 
))(log)(log(
)log()(log)(
22
}1,0{
2
v
v
v
v
v
v
v
v
v
v
t
n
t
n
t
p
t
p
t
t
t
n
t
n
t
p
t
p
xGain



ポジティブサンプル数ネガティブサンプル数
トータルサンプル数
対象特徴を固定した場合のサンプル数

非負値行列因子分解
 Nグラムワードカウントは多次元な非負値行列
 非負値の特性を保ったまま次元圧縮
 非負値行列を非負値行列の積に分解
 下の例では５次元から２次元に圧縮



































20011
01210
13
00
21
01
23641
00000
41232
01210

ランダムフォレスト
 分類器ではなく特徴選択手法として利用
 学習後にFeature Importanceの低い特徴を捨てる

XGBoost
 高速・多機能な勾配ブースティングの実装
 アンサンブル木学習 + 勾配法
 勾配法の要領で逐次的に弱い木を学習
))(,()()( 1
1
1 xFyLxFxF ti
n
i
Fttt 

  
前ステップまでに
学習したフォレスト
次ステップの木は
前ステップの負勾配にフィット

Averaging
 複数モデル出力の単純平均をサブミットする
 幾何平均で性能が向上することも
 Averaging multiple different green lines should bring us
closer to the black line. [5]

Stacking
 複数モデルの出力を統合するモデルを学習
XGBoost
Neural
Network
XGBoost
Nearest
Neighbors
XGBoost
Random
Forest
Extra
Tree
Averaging

まとめ
ワードカウントベースの
特徴抽出
情報量ゲインや行列分解による特徴数のバランシング
ファイルの
メタ情報
テクスチャ
画像
XGBoost
Averaging or Stacking

Opcode
2-gram
Opcode
3-gram
Opcode
4-gram
Header
1-gram
Hexdump
4-gram
&
Info Grain
DAF
1-gram
DLL
1-gram
Random
Forest
XGBoost
Assembly
Texture
Image
Instruction
1-gram
Hexdump
1-gram
Random
Forest
Semi-supervised Learning with Test Dataset
Averaging

逆アセンブリによるテクスチャ画像
 逆アセンブリファイルのバイト列をテクスチャ化
 先頭１０００ピクセルの画素値を特徴量とする
ヘキサダンプ
テクスチャ
逆アセンブリ
テクスチャ

テストデータを含めた半教師あり学習
 トレーニングデータでモデル学習（中間モデル）
 中間モデルで全テストデータをラベル付け
 ラベル付きテストデータを複数のチャンクに分割
 各チャンクに対し：
 対象チャンク以外のトレーニングデータとラベル付き
テストデータで最終モデルを学習
 最終モデルで対象チャンクのクラス確率を予測
 各チャンクの結果を統合

Opcode
2-gram
Opcode
3-gram
Opcode
4-gram
Segment
1-gram
Hexdump
4-gram
&
Info Grain
DAF
1-gram
DLL
1-gram
Random
Forest
XGBoost
Assembly
Texture
Image
Instruction
1-gram
Hexdump
1-gram
Random
Forest
Semi-supervised Learning with Test Dataset
Averaging
Golden
Features

どの特徴が効いていたか？
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
Opcode-count Opcode Count
Segment Count
Opcode Count
Segment Count
ASM Texture
All Features
Cross
Validation
Public
Leaderboard
Private
Leaderboard
Log-loss

Private Leaderboard
Public Leaderboard

リファレンス
1. First place code and documents
 https://www.kaggle.com/c/malware-
classification/forums/t/13897/first-place-code-and-documents
2. 2nd place code and documentation
classification/forums/t/13863/2nd-place-code-and-documentation
3. 3rd place code and documentation
classification/forums/t/14065/3rd-place-code-and-documentation

リファレンス
4. Beat the benchmark (~0.182) with RandomForest
 https://www.kaggle.com/c/malware-classification/forums/t/12490/beat-
the-benchmark-0-182-with-randomforest
5. Kaggle Ensembling Guide
 http://mlwave.com/kaggle-ensembling-guide

リファレンス
6. Masud, M. M., Khan, L., and Thuraisingham, B., “A
Scalable Multi-level Feature Extraction Technique to
Detect Malicious Executables,” Information Systems
Frontiers, Vol. 10, No. 1, pp. 33-45, (2008).
7. Nataraj, L., Yegneswaran, V., Porras, P. and Zhang, J. “A
Comparative Assessment of Malware Classification Using
Binary Texture Analysis and Dynamic Analysis,”
Proceedings of the 4th ACM Workshop on Security and
Artificial Intelligence, 21-30 (2011).

Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)

Ähnlich wie Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup) (20)

Mehr von Shotaro Sano

Mehr von Shotaro Sano (7)

Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)