2. Ph.D. Candidate at Deakin University.
Research
◦ Malware detection.
◦ Automated vulnerability discovery (check out my
other talk in the main conference).
Did a Masters by research in malware
◦ “Fast automated unpacking and classification of
malware”.
◦ Presented last year at Ruxcon 2010.
This current work extends last year’s work.
3. Traditional AV works well on known samples.
Doesn’t detect unknown samples.
Doesn’t detect “suspiciously similar” samples.
Uses strings as a signature or “birthmark”.
Compares birthmarks by equality.
4. Birthmarks can be program structure.
More static among malware variants.
Birthmarks can be compared using “approximate
similarity”.
Able to detect unknown samples that are
suspiciously similar to known malware.
Vastly reduce number of required signatures.
5. Program p Birthmark MATCH!
Similar?
Program q Birthmark Different
6. Control flow is more invariant among
polymorphic and metamorphic malware.
A directed graph representing control flow.
A control flow graph for every procedure.
One call graph per program.
7. lea 0x4(%esp),%ecx
and $0xfffffff0,%esp Proc_0
pushl -0x4(%ecx)
push %ebp
mov %esp,%ebp
push %ecx
sub $0x24,%esp
call 4011b0 <___main>
movl $0x0,-0x8(%ebp)
jmp 40115f <_main+0x2f>
Proc_1 Proc_3
movl $0x4020a0,(%esp)
call 4011b8 <_puts>
addl $0x1,-0x8(%ebp)
cmpl $0x9,-0x8(%ebp) Proc_4
jle 40114f <_main+0x1f>
add $0x24,%esp
pop %ecx
pop %ebp Proc_2
lea -0x4(%ecx),%esp
ret
8. Known as the “Graph Isomorphism” problem.
Identifies equivalent “structure”.
Not proven to be in NP, but no polynomial
time algorithm known.
9. The number of basic operations applied to a
graph to transform it to another graph.
If you know the distance between two
objects, you know the similarity.
Complexity in NP and infeasible.
11. Input is a string.
Extract all substrings of fixed size Q.
Substrings are known as q-grams.
Let’s take q-grams of all decompiled graphs.
W|IE
|IEH
W|IEH}R
IEH}
EH}R
12. An array <E1,...,En>
A feature vector describes the number of
occurrences of each feature.
En is the number of times feature En occurs.
Let’s make the 500 most common q-grams
as features.
We use feature vectors as birthmarks.
13. A vector is an n-dimensional point.
E.g. 2d vector is <x,y>
Fast.
14. Software similarity problem extended to
similarity search over a database.
Find nearest neighbours (by distance) of a
query.
Or find neighbours within a distance of the
query.
15. Query Benign
r
q
d(p,q)
p
Query Malicious
Query
Malware
16. Vector distances here are “metric”.
It has the mathematical properties of a
metric.
This means you can do a nearest neighbour
search without brute forcing the entire
database!
17. System is 100,000 lines of code of C++.
The modules for this work < 3000 lines of code.
System translates x86 into an intermediate
language (IL).
Performs analysis on architecture independent IL.
Unpacks malware using an application level
emulator.
18. Database of 10,000 malware.
Scanned 1,601 benign binaries.
10 false positives. Less than 1%.
Using additional refinement algorithm,
reduced to 7 false positives.
Very small binaries have small signatures and
cause weak matching.
19. Calculated similarity between Roron malware
variants.
Compared results to Ruxcon 2010 work.
In tables, highlighted cells indicates a positive
match.
The more matches the more effective it is.
20. ao b d e g k m q a ao b d e g k m q a
ao 0.44 0.28 0.27 0.28 0.55 0.44 0.44 0.47 ao 0.70 0.28 0.28 0.27 0.75 0.70 0.70 0.75
b 0.44 0.27 0.27 0.27 0.51 1.00 1.00 0.58 b 0.74 0.31 0.34 0.33 0.82 1.00 1.00 0.87
d 0.28 0.27 0.48 0.56 0.27 0.27 0.27 0.27 d 0.28 0.29 0.50 0.74 0.29 0.29 0.29 0.29
e 0.27 0.27 0.48 0.59 0.27 0.27 0.27 0.27 e 0.31 0.34 0.50 0.64 0.32 0.34 0.34 0.33
g 0.28 0.27 0.56 0.59 0.27 0.27 0.27 0.27 g 0.27 0.33 0.74 0.64 0.29 0.33 0.33 0.30
k 0.55 0.51 0.27 0.27 0.27 0.51 0.51 0.75 k 0.75 0.82 0.29 0.30 0.29 0.82 0.82 0.96
m 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 m 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87
q 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 q 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87
a 0.47 0.58 0.27 0.27 0.27 0.75 0.58 0.58 a 0.75 0.87 0.30 0.31 0.30 0.96 0.87 0.87
Exact Matching Heuristic Approximate
(Ruxcon 2010) Matching (Ruxcon 2010)
ao b d e g k m q a
ao 0.86 0.53 0.64 0.59 0.86 0.86 0.86 0.86
b 0.88 0.66 0.76 0.71 0.97 1.00 1.00 0.97
d 0.65 0.72 0.88 0.93 0.73 0.72 0.72 0.73
e 0.72 0.80 0.87 0.93 0.80 0.80 0.80 0.80
g 0.69 0.77 0.93 0.93 0.77 0.77 0.77 0.77
k 0.88 0.97 0.67 0.77 0.72 0.97 0.97 0.99
m 0.88 1.00 0.66 0.76 0.71 0.97 1.00 0.97
q 0.88 1.00 0.66 0.76 0.71 0.97 1.00 0.97
a 0.87 0.97 0.67 0.77 0.72 0.99 0.97 0.97
Q-Grams
21. Faster than Ruxcon 2010.
Median benign processing time is 0.06s.
Median malware processing time is 0.84s.
Slowest result may be memory thrashing.
% Benign Malware
Samples Time(s) Time(s)
10 0.02 0.16
20 0.02 0.28
30 0.03 0.30
40 0.03 0.36
50 0.06 0.84
60 0.09 0.94
70 0.13 0.97
80 0.25 1.03
90 0.56 1.31
100 8.06 585.16
22. Improved effectiveness and efficiency compared to
Ruxcon 2010.
Runs in real-time in expected case.
Large functional code base and years of development
time.
Happy to talk to vendors.
23. Full academic paper at IEEE Trustcom.
Research page http://www.foocodechu.com
Book on “Software similarity and classification”
available in 2012.
Wiki on software similarity and classification
http://www.foocodechu.com/wiki