From DNA Sequence Variation to .NET Bits and Bobs

About us
We analyse files on a daily basis to determine if they are
malicious and that includes Windows 8 Apps and Windows
Phone apps.
For the past few years we have been involved in fields like
bioinformatics, molecular biology and genetics allowing us to
extrapolate some of the ideas/algorithms used in the bio field
and apply them to malware classification and detection
purposes.
About us
About DNA
.NET disassembler
Clustering
IDA plugin

About DNA
- DNA is made of four chemical building blocks called
nucleotides: adenine (A), thymine (T), cytosine (C) and
guanine (G).
- A three-nucleotide series (called codon) in a DNA
sequence specifies a single amino acid.
- The DNA sequences are translated to amino acids that
produce proteins.
- Each DNA sequence that contains instructions to make a
protein is known as a gene.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
moleculesoflife2010.wikispaces.com/Protein+Structure

About DNA sequence variation
The human genome comprises about 3 billion base pairs of DNA.
Due to various factors, mutations occur so the DNA sequence may change.
Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”), are
the most common type of genetic variation among people.
Each SNP represents a difference in a single DNA building block.
They can act as biological markers, helping scientists locate genes that are
associated with disease..
About us
About DNA
.NET disassembler
Clustering
IDA plugin

About GWAS
A genome-wide association study (GWAS) is an approach used in genetics research to
associate specific genetic variations with particular diseases.
The method involves scanning the genomes (1 million SNPs) from many different
people (healthy and carriers) and looking for genetic markers that can be used to
predict the presence of a disease.
The results of a GWAS are often displayed in a scatter plot (called a Manhattan plot),
in which the peaks indicate regions of the genome associated with that disease.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Manhattan plot showing the −log10 P values of
606,164 SNPs in the GWAS for 1,472 Japanese
atopic dermatitis (also known as atopic eczema,
is a non-contagious itchy skin disorder) cases
and 7,971 controls plotted against their
respective positions on autosomes and the X
chromosome
www.nature.com/ng/journal/v44/n11/fig_tab/ng.2438_F1.html

The DNA code is read three letters at a time
(these DNA triplets are called codons)
Most of the codons correspond to a specific
amino acid. However some of the 64 codons
code for the same amino acid.
Also three of the codons are used as 'stop'
signals (STOP codon) and another is the
'start' signal (START codon).
This resembles the way a disassembler
works. Here the binary machine code is the
DNA sequence and the assembly code are
the amino acids.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
CCCTGTGGAGCCACACCCTAG
CCC TGT GGA GCC ACA CCC TAG
Amino acids CIL(MSIL) instructions
CCC - Proline 288B00000A call
TGT - Cysteine 03 ldarg.1
GGA – Glycine 7D52000004 stfld
GCC - Alanine 02 ldarg.0
ACA – Threonine 04 ldarg.2
CCC - Proline 288B00000A call
TAG -STOP 2A ret

The CLR header can be reached from the IMAGE_DATA_DIRECTORY structure.
Then we have access to the offset to the MetaData header that holds the number
of streams.
Immediately after, we have the headers for each stream contained inside the file.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
typedef struct CLR_HEADER
{
DWORD SizeOfStructure;
WORD MajorRuntimeVersion;
WORD MinorRuntimeVersion;
IMAGE_DATA_DIRECTORY MetaData;
…..
typedef struct METADATA_HEADER
{
…
IMAGE_DATA_DIRECTORY NoOfStreams;
…..
typedef struct STREAM_HEADERSR
{
DWORD Offset;
DWORD Size;
unsigned char * Name;
…..

We are interested in #~ (the metadata stream) because it contains the
information about the methods.
- The #~ table header contains a bitmask-QWORD that tells us the tables
present in this stream. (For example we can have the TypeRef, TypeDef,
MethodDef, Field, etc. tables). Out of all, we are interested in the MethodDef
table because it contains the RVAs of the method bodies.
- Following the #~ header we have a set of DWORDs specifying the number of
rows for each table that is present.
- After them we have the actual Metadata tables.
- The RVA within the MethodDef table tells us where the body of the method
can be found.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
typedef struct TABLE_HEADER
{
DWORD Reserved;
WORD MajorVersion;
WORD MinorVersion;
…
QWORD ValidMask;
…..
typedef struct TABLE_METHODDEF
{
DWORD RVA;
WORD ImplFlags;
WORD Flags;
WORD NameIndex;
…..

For each method the RVA is the offset to the first instruction.
The Common Intermediate Language (CIL), formerly MSIL, instructions are
encoded using a variable-length instruction encoding, where 1 or 2 bytes are
used to represent the instruction.
We continue to disassemble from the first instruction until we reach RET (opcode
0x2A in CIL).
All the instructions are split into basic blocks and we pick only
the first operand (FOP).
We have a set of rules that will filter out garbage instructions.
We then do a CRC on the list of FOPs and add it in the database.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
CIL(MSIL) FOPs
288B00000A call
03 ldarg.1
7D52000004 stfld
02 ldarg.0
04 ldarg.2
288B00000A call
2A ret

Clustering - basics
Feature set:
- CRCIDs representing the hashes of
each FOPS present in a given file
- Double[ ] file1 = [1, 32, 5673, 5674,
5675, 18001, …, 18607];
Distance measure:
- Jaccard index: size of intersection
divided by the size of the union of two
sets.
- Derivate we use: size of smallest of
the two sets divided by the size of
the union.
- Gives a similarity value between 0
and 1, subtracting that to 1 gives us
a distance measure.
About us
About DNA
.NET disassembler
Clustering
IDA plugin

Assume 0.01s on average per distance computation
A simplistic implementation would give a complexity of O(n2)
- Computing the distance for every possible pair of files
- For example, imagine having to cluster 1500 files:
(1500) 2 * 0.01 = 22500s (6.25 hours)
Clearly doesn’t scale well
About us
About DNA
.NET disassembler
Clustering
IDA plugin

Our mitigation techniques to improve speed:
Loading all the files in memory and ordering them by amount of FOPs they
contain.
Only compute distance when size ratio is within the threshold value, possible
due to properties of our distance computation function.
Use of prototypes for agglomerative clustering
- In each cluster, the smallest file is elected as “prototype” to represent that
cluster.
- When doing agglomerative clustering, new files to the prototypes of each
clusters until we find a distance within the threshold, or alternatively put the
file in a new cluster.
About us
About DNA
.NET disassembler
Clustering
IDA plugin

About us
About DNA
.NET disassembler
Clustering
IDA plugin

Clustering animation – Threshold = 30%
90 35 88 87 40 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin

9035 888740 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin

90
35
888740 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin

90
35
8887 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin

908887 92
35
above threshold!
About us
About DNA
.NET disassembler
Clustering
IDA plugin

9088
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin

90
8887
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin

90
88
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin

88 92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin

88
92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin

88 92
35 87
About us
About DNA
.NET disassembler
Clustering
IDA plugin

88
92
35 87
About us
About DNA
.NET disassembler
Clustering
IDA plugin

35 87
88
About us
About DNA
.NET disassembler
Clustering
IDA plugin

About us
About DNA
.NET disassembler
Clustering
IDA plugin
312
1000
1500
4604
7380
6.655 81.644
759.799
945.557
1941.852
Clustering speed (Threshold of 80%)
Number of files to cluster Time taken to complete (seconds)
840
1500
7380
3.058 14 35.475
Clustering speed (Threshold of 20%)
Number of files to cluster Time taken to complete (seconds)

Time taken to cluster the same 1500 files from the previous example is now
drastically improved and follow the threshold value:
- With the simplistic approach:
 22500s
- With mitigation techniques and threshold of 80%:
 760s
- With mitigation techniques and threshold of 20%:
 14s
About us
About DNA
.NET disassembler
Clustering
IDA plugin

We need:
- a file from the database that we know is malicious (we’ve selected
Pameseg/ArchSMS)
- a loose cluster that the file is part of (we’ve selected a cluster that had 399
files)
Algorithm:
- for each CRC present in the target file, we extract the number of files where
that CRC is present
- calculate the median and remove everything that’s above based on the
assumption that most prevalent CRCs are clean (they are also found in
clean files). After this step we got 285 files.
- use the following formula to get the CRCs that are most probably
malicious.
k – total number of CRCs
Nfi – number of files containing a specific CRC
p – the default p-value (0.05)
Di – distance of the specific CRC
About us
About DNA
.NET disassembler
Clustering
IDA plugin

- Using the set of data from
gettinggeneticsdone.blogspot.com/2011/04/annotated-manhattan-plots-and-
qq-plots.html,,(200,000 SNPs) and applying the same approach we get:
About us
About DNA
.NET disassembler
Clustering
IDA plugin

Applying the formula on our example dataset of 285 files (that was left after we
applied the median) we got a similar result with the GWAS data.
We took the first two CRCs and ran a query for each one in order to see which
files contain them. The result was a set of 10 files, all of which were found to be
malicious and from the same family (Pameseg/ ArchSMS).
About us
About DNA
.NET disassembler
Clustering
IDA plugin

Similar to what geneticists are doing in order to analyse
genetic variants and identify their link to various diseases, we
have implemented a similar approach so it can help us to
automatically identify malicious files.

The IDA plugin shows the areas of the code that
require more attention. This will reduce the time for
manual analysis.
We can extend the clustering algorithm to other
features like instructions, behaviour data, etc.
In the future we plan to extend the approach to other
type of files and other platforms.

Will this method be effective with packed files ?
Weill this method be effective with obfuscated .NET files ?
Does the plugin improve analysis time ?
Can the CRCs be used as part of generic detections / family
classification ?
The effect of the speed mitigation strategies and the used a
derivative of the Jaccard index ?
Other questions, thoughts, etc…

From DNA Sequence Variation to .NET Bits and Bobs

From DNA Sequence Variation to .NET Bits and Bobs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie From DNA Sequence Variation to .NET Bits and Bobs

Ähnlich wie From DNA Sequence Variation to .NET Bits and Bobs (20)

Mehr von Source Conference

Mehr von Source Conference (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

From DNA Sequence Variation to .NET Bits and Bobs

Hinweis der Redaktion