A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Scalable Approach for Malware Detec2on through Bounded Feature Space Behavior Modeling
1. A
Scalable
Approach
for
Malware
Detec2on
through
Bounded
Feature
Space
Behavior
Modeling
Mahinthan Chandramohan, Tan Hee Beng Kuan, Lionel
Briand,
Shar Lwin Khin, and Bindu Madhavi Padmanabhuni
Interdisciplinary
Centre
for
ICT
Security,
Reliability,
and
Trust
University
of
Luxembourg,
Luxembourg
School
of
Electrical
and
Electronic
Engineering,
Nanyang
Technological
University,
Singapore
2. What
is
malware?
Malware
(malicious
+
soFware)
is
nothing
but
a
soFware
that
do
malicious
things
without
the
vicHm’s
knowledge
3. Mo2va2on
Ø More
than
403
million
new
malware
variants
were
created
in
2011,
a
41%
increase
over
2010.
Ø On
average
around
55,000
new
malware
samples
were
reported
per
day.
Ø ExponenHal
growth
of
malware
is
a
major
threat
in
the
soFware
industry
4. Problem
Defini2on
1/2
q New
malware
has
become
very
sophisHcated.
q Malware
evade
tradiHonal
anH-‐virus
signatures,
using
various
obfuscaHon
techniques.
q Malware
authors
change
the
syntacHc
characterisHcs
(i.e.,
structure)
of
a
malicious
program
without
changing
its
semanHcs
(i.e.,
behavior)
5. Problem
Defini2on
2/2
q Scalability
is
a
major
problem
in
exisHng
behavior-‐based
malware
detecHon
techniques
§ malware
feature
space
grows
in
proporHon
with
the
number
of
samples
under
examinaHon
§ ComputaHonally
very
intensive
6. Related
Work
1/2
q PracHcality
and
efficiency
of
behavior
based
malware
detecHon
depends
on:
• size
of
feature
space,
• computaHonal
complexity,
• overheads
(e.g.,
pre-‐processing)
• detecHon
accuracy
q Simple
malware
behavior
models
(e.g.,
n-‐gram,
m-‐bag
and
k-‐tuple)
generate
huge
feature
spaces
and
require
various
pruning
and
parameter
tuning
mechanisms
7. Related
Work
2/2
q Complex
malware
behavior
models
(e.g.,
system
call
dependency
graphs)
are
highly
computaHonally
intensive
8. Behavior
Modeling
–
An
Overview
Ø SoFware
program
perform
ac#ons
on
various
operaHng
system
resources.
Ø An
acHon
corresponds
to
a
higher-‐level
operaHon
(e.g.,
reading
a
file)
composed
of
a
set
of
related
system
calls
(e.g.,
NtReadFile)
Ø Advantage
of
using
acHons
over
system
calls
is
that
OS
may
use
different
names
for
system
calls
that
are
in
fact
serving
the
same
purpose
Ø NtCreateProcess
and
NtCreateProcessEx
maps
to
CreateProcess
acHon
9. Opera2ng
System
Resource
Types
ü File
System
ü Registry
ü Process
and
Thread
ü Network
ü SynchronizaHon
ü SecHon
10. Bounded
Feature
space
behavior
Modeling
(BOFM)
Malware
feature
For
each
type
of
OS
resource,
the
set
of
acHons
performed
by
malware
on
an
instance
of
the
OS
resource
type
concerned
consHtutes
a
feature
of
the
malware
Ø Example:
Malware
performs,
CreateFile
and
DeleteFile
acHons
on
a
file
instance
C:foo.exe,
and
DeleteFile
acHon
on
another
file
instance
C:abc.dll
This
malware
has
two
features,
{CreateFile,
DeleteFile}
and
{DeleteFile}
with
respect
to
file
resource
instances
C:foo.exe
and
C:abc.dll,
respecHvely.
11. ü Goal:
To
be
more
resilient
to
commonly
used
obfuscaHon
techniques
v Property
1:
Regardless
of
the
number
of
Hmes
an
acHon
is
performed
on
an
OS
resource
instance
it
is
considered
only
once
in
final
feature
set.
E.g.,
ReadFile
acHon
is
performed
several
Hmes
on
a
file
instance
C:
Windows...sysfile2.dll;
this
behavior
is
modeled
by
a
BOFM
feature
{ReadFile}
v Property
2:
The
sequence,
in
which
the
acHons
are
performed,
by
malware,
is
ignored
in
feature
construcHon.
E.g.,
malware
features
{ReadFile,
QueryFileInforma9on}
and
{QueryFileInforma9on,
ReadFile}
are
considered
idenHcal.
Proper2es
of
BOFM
features
1/2
12. v Property
3:
IdenHcal
acHon
sets
which
are
performed
on
two
different
OS
resource
instances
of
same
type
are
modeled
as
a
single
feature.
E.g.,
acHons
CreateFile
and
DeleteFile
performed
on
two
different
file
resource
instances
C:Windowsabc.dll
and
D:Personel
foo.exe
are
modeled
as
a
single
BOFM
feature
{CreateFile,
DeleteFile}
Proper2es
of
BOFM
features
2/2
13. Goal:
Avoid
malware
feature
space
growth
proporHonal
to
number
of
samples
under
examinaHon
• Lets
j
to
be
OS
resource
type,
where
• Total
number
kj
of
possible
acHons
that
a
malware
may
perform
on
an
OS
resource
instance
of
type
j
is
a
constant
• Maximum
number
mj
of
possible
features
with
regard
to
OS
resource
type
j
is
also
a
constant
Where,
• Maximum
number
of
possible
features
N
for
all
resource
types
is
always
the
following
constant
:
Bounded
Feature
Space
14. OS
Resource
Types
and
Corresponding
Ac2ons
Total
malware
features
(N)
extracted
from
these
six
OS
resources
is
16,652
16. Detec2on
Method
Ø Machine
Learning
(ML)
classificaHon
techniques
used
for
building
Malware
DetecHon
models
Ø LogisHc
Regression
(LR)
and
Support
Vector
Machine
(SVM)
are
used
in
our
experiments
Ø Malware
detecHon
process
involves
two
phases
• Phase
1:
model
building
phase
• Phase
2:
model
evaluaHon
phase
17. Experimental
Dataset
ü
Training-‐set
of
5000
malware
and
80
benign
samples
and
a
test-‐set
of
300
malware
and
20
benign
samples
18. Experimental
Results
ü SVM
achieved
99.4%
detecHon
accuracy
with
no
false
posiHves
and
LR
achieved
99.6%
detecHon
accuracy
with
1%
FP
rate
ü Balanced
test-‐sets
consists
of
20
randomly
selected
(from
a
pool
of
300
samples)
malware
samples
and
the
20
benign
samples.
ü For
balance
test-‐sets
SVM
yielded
a
perfect
accuracy
of
100%
with
0%
FP
rate
and
LR
achieved
99.5%
detecHon
accuracy
with
1%
FP
rate.
19. Comparison
with
Canali
et
al.
(ISSTA
2012)
q
Both
achieve
99%
detecHon
accuracy
q However,
§ BOFM
generated
only
569
acHve
features
whereas
Canali
et
al.
generated
several
millions.
§
It
took
1.67
hrs
to
extract
malware
features
using
BOFM
while
Canali
et
al.
took
around
48
hrs.
§
It
took
26
seconds
to
train
the
SVM
classifier,
consuming
only
200MB
RAM.
Whereas,
Canali’s
approach
consumed
more
than
1GB
RAM
to
perform
signature
matching.
§ BOFM
is
much
more
efficient
and
scalable
20. Conclusion
ü Malware
evade
tradiHonal
anH-‐virus
signatures,
using
various
obfuscaHon
techniques.
ü Behavior-‐based
malware
detecHon
is
an
increasingly
common
soluHon
ü Scalability
is
a
major
problem
in
exisHng
behavior-‐based
malware
detecHon
techniques
ü We
proposed
a
bounded
feature
space
malware
behavior
modeling
(BOFM)
technique
to
address
the
scalability
issue.
ü BOFM
entails
a
fixed
number
of
features
that
do
not
grow
in
proporHon
with
the
number
of
malware
samples
under
examinaHon
ü Benchmark:
BOFM
combined
with
SVM
achieved
100%
detecHon
accuracy,
within
less
than
a
minute
and
200
MB
of
memory
21. Feature
Space
Analysis
• Comparison
of
malware
and
benign
feature
spaces
• 57%
of
unique
malware
features
suggests
that
BOFM
is
a
promising
technique
to
model
the
malware
behavior
22. Brief
Analysis
of
Interes2ng
Features
Ø ‘NoHfyChangeKey’
acHon
is
very
widely
used
by
malware
samples
compared
to
benign
samples
(86%
Vs.
15%).
Ø ‘DeleteKey’
acHon
also
widely
used
by
malware
samples.
Ø AcHons
such
as
‘OpenFile’,
‘GetFileAmributes’,
‘CreateMutex’
and
‘ReleaseMutex’
widely
appeared
in
both
malware
and
benign
samples.