SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Talk outline
About us
We analyse files on a daily basis to determine if they are
malicious and that includes Windows 8 Apps and Windows
Phone apps.
For the past few years we have been involved in fields like
bioinformatics, molecular biology and genetics allowing us to
extrapolate some of the ideas/algorithms used in the bio field
and apply them to malware classification and detection
purposes.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
About DNA
- DNA is made of four chemical building blocks called
nucleotides: adenine (A), thymine (T), cytosine (C) and
guanine (G).
- A three-nucleotide series (called codon) in a DNA
sequence specifies a single amino acid.
- The DNA sequences are translated to amino acids that
produce proteins.
- Each DNA sequence that contains instructions to make a
protein is known as a gene.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
moleculesoflife2010.wikispaces.com/Protein+Structure
About DNA sequence variation
The human genome comprises about 3 billion base pairs of DNA.
Due to various factors, mutations occur so the DNA sequence may change.
Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”), are
the most common type of genetic variation among people.
Each SNP represents a difference in a single DNA building block.
They can act as biological markers, helping scientists locate genes that are
associated with disease..
About us
About DNA
.NET disassembler
Clustering
IDA plugin
About GWAS
A genome-wide association study (GWAS) is an approach used in genetics research to
associate specific genetic variations with particular diseases.
The method involves scanning the genomes (1 million SNPs) from many different
people (healthy and carriers) and looking for genetic markers that can be used to
predict the presence of a disease.
The results of a GWAS are often displayed in a scatter plot (called a Manhattan plot),
in which the peaks indicate regions of the genome associated with that disease.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Manhattan plot showing the −log10 P values of
606,164 SNPs in the GWAS for 1,472 Japanese
atopic dermatitis (also known as atopic eczema,
is a non-contagious itchy skin disorder) cases
and 7,971 controls plotted against their
respective positions on autosomes and the X
chromosome
www.nature.com/ng/journal/v44/n11/fig_tab/ng.2438_F1.html
The DNA code is read three letters at a time
(these DNA triplets are called codons)
Most of the codons correspond to a specific
amino acid. However some of the 64 codons
code for the same amino acid.
Also three of the codons are used as 'stop'
signals (STOP codon) and another is the
'start' signal (START codon).
This resembles the way a disassembler
works. Here the binary machine code is the
DNA sequence and the assembly code are
the amino acids.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
CCCTGTGGAGCCACACCCTAG
CCC TGT GGA GCC ACA CCC TAG
Amino acids CIL(MSIL) instructions
CCC - Proline 288B00000A call
TGT - Cysteine 03 ldarg.1
GGA – Glycine 7D52000004 stfld
GCC - Alanine 02 ldarg.0
ACA – Threonine 04 ldarg.2
CCC - Proline 288B00000A call
TAG -STOP 2A ret
The CLR header can be reached from the IMAGE_DATA_DIRECTORY structure.
Then we have access to the offset to the MetaData header that holds the number
of streams.
Immediately after, we have the headers for each stream contained inside the file.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
typedef struct CLR_HEADER
{
DWORD SizeOfStructure;
WORD MajorRuntimeVersion;
WORD MinorRuntimeVersion;
IMAGE_DATA_DIRECTORY MetaData;
…..
typedef struct METADATA_HEADER
{
…
IMAGE_DATA_DIRECTORY NoOfStreams;
…..
typedef struct STREAM_HEADERSR
{
DWORD Offset;
DWORD Size;
unsigned char * Name;
…..
We are interested in #~ (the metadata stream) because it contains the
information about the methods.
- The #~ table header contains a bitmask-QWORD that tells us the tables
present in this stream. (For example we can have the TypeRef, TypeDef,
MethodDef, Field, etc. tables). Out of all, we are interested in the MethodDef
table because it contains the RVAs of the method bodies.
- Following the #~ header we have a set of DWORDs specifying the number of
rows for each table that is present.
- After them we have the actual Metadata tables.
- The RVA within the MethodDef table tells us where the body of the method
can be found.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
typedef struct TABLE_HEADER
{
DWORD Reserved;
WORD MajorVersion;
WORD MinorVersion;
…
QWORD ValidMask;
…..
typedef struct TABLE_METHODDEF
{
DWORD RVA;
WORD ImplFlags;
WORD Flags;
WORD NameIndex;
…..
For each method the RVA is the offset to the first instruction.
The Common Intermediate Language (CIL), formerly MSIL, instructions are
encoded using a variable-length instruction encoding, where 1 or 2 bytes are
used to represent the instruction.
We continue to disassemble from the first instruction until we reach RET (opcode
0x2A in CIL).
All the instructions are split into basic blocks and we pick only
the first operand (FOP).
We have a set of rules that will filter out garbage instructions.
We then do a CRC on the list of FOPs and add it in the database.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
CIL(MSIL) FOPs
288B00000A call
03 ldarg.1
7D52000004 stfld
02 ldarg.0
04 ldarg.2
288B00000A call
2A ret
Clustering
Clustering - basics
Feature set:
- CRCIDs representing the hashes of
each FOPS present in a given file
- Double[ ] file1 = [1, 32, 5673, 5674,
5675, 18001, …, 18607];
Distance measure:
- Jaccard index: size of intersection
divided by the size of the union of two
sets.
- Derivate we use: size of smallest of
the two sets divided by the size of
the union.
- Gives a similarity value between 0
and 1, subtracting that to 1 gives us
a distance measure.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Assume 0.01s on average per distance computation
A simplistic implementation would give a complexity of O(n2)
- Computing the distance for every possible pair of files
- For example, imagine having to cluster 1500 files:
(1500) 2 * 0.01 = 22500s (6.25 hours)
Clearly doesn’t scale well
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Our mitigation techniques to improve speed:
Loading all the files in memory and ordering them by amount of FOPs they
contain.
Only compute distance when size ratio is within the threshold value, possible
due to properties of our distance computation function.
Use of prototypes for agglomerative clustering
- In each cluster, the smallest file is elected as “prototype” to represent that
cluster.
- When doing agglomerative clustering, new files to the prototypes of each
clusters until we find a distance within the threshold, or alternatively put the
file in a new cluster.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
90 35 88 87 40 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
9035 888740 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
9035 888740 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
90
35
888740 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
90
35
8887 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
908887 92
35
above threshold!
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
9088
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
9088
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
9088
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
90
8887
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
90
88
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
88 92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
88 92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
88
92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
88 92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
88 92
35 87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
88
92
35 87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
35 87
88
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Clustering animation – Threshold = 30%
35 87
88
About us
About DNA
.NET disassembler
Clustering
IDA plugin
About us
About DNA
.NET disassembler
Clustering
IDA plugin
312
1000
1500
4604
7380
6.655 81.644
759.799
945.557
1941.852
Clustering speed (Threshold of 80%)
Number of files to cluster Time taken to complete (seconds)
840
1500
7380
3.058 14 35.475
Clustering speed (Threshold of 20%)
Number of files to cluster Time taken to complete (seconds)
Time taken to cluster the same 1500 files from the previous example is now
drastically improved and follow the threshold value:
- With the simplistic approach:
 22500s
- With mitigation techniques and threshold of 80%:
 760s
- With mitigation techniques and threshold of 20%:
 14s
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Viewing the
clustered data
We need:
- a file from the database that we know is malicious (we’ve selected
Pameseg/ArchSMS)
- a loose cluster that the file is part of (we’ve selected a cluster that had 399
files)
Algorithm:
- for each CRC present in the target file, we extract the number of files where
that CRC is present
- calculate the median and remove everything that’s above based on the
assumption that most prevalent CRCs are clean (they are also found in
clean files). After this step we got 285 files.
- use the following formula to get the CRCs that are most probably
malicious.
k – total number of CRCs
Nfi – number of files containing a specific CRC
p – the default p-value (0.05)
Di – distance of the specific CRC
About us
About DNA
.NET disassembler
Clustering
IDA plugin
- Using the set of data from
gettinggeneticsdone.blogspot.com/2011/04/annotated-manhattan-plots-and-
qq-plots.html,,(200,000 SNPs) and applying the same approach we get:
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Applying the formula on our example dataset of 285 files (that was left after we
applied the median) we got a similar result with the GWAS data.
We took the first two CRCs and ran a query for each one in order to see which
files contain them. The result was a set of 10 files, all of which were found to be
malicious and from the same family (Pameseg/ ArchSMS).
About us
About DNA
.NET disassembler
Clustering
IDA plugin
IDA Python Plugin
About us
About DNA
.NET disassembler
Clustering
IDA plugin
About us
About DNA
.NET disassembler
Clustering
IDA plugin
About us
About DNA
.NET disassembler
Clustering
IDA plugin
About us
About DNA
.NET disassembler
Clustering
IDA plugin
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Similar to what geneticists are doing in order to analyse
genetic variants and identify their link to various diseases, we
have implemented a similar approach so it can help us to
automatically identify malicious files.
The IDA plugin shows the areas of the code that
require more attention. This will reduce the time for
manual analysis.
We can extend the clustering algorithm to other
features like instructions, behaviour data, etc.
In the future we plan to extend the approach to other
type of files and other platforms.
Will this method be effective with packed files ?
Weill this method be effective with obfuscated .NET files ?
Does the plugin improve analysis time ?
Can the CRCs be used as part of generic detections / family
classification ?
The effect of the speed mitigation strategies and the used a
derivative of the Jaccard index ?
Other questions, thoughts, etc…
From DNA Sequence Variation to .NET Bits and Bobs

Weitere ähnliche Inhalte

Ähnlich wie From DNA Sequence Variation to .NET Bits and Bobs

Dna computing
Dna computingDna computing
Dna computing
sathish3
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
IAEME Publication
 

Ähnlich wie From DNA Sequence Variation to .NET Bits and Bobs (20)

User biglm
User biglmUser biglm
User biglm
 
BLAST_CSS2.ppt
BLAST_CSS2.pptBLAST_CSS2.ppt
BLAST_CSS2.ppt
 
A Dna And Amino-Acids Based Implementation Of Four-Square Cipher
A Dna And Amino-Acids Based Implementation Of Four-Square CipherA Dna And Amino-Acids Based Implementation Of Four-Square Cipher
A Dna And Amino-Acids Based Implementation Of Four-Square Cipher
 
The steps of R code Master.pptx
The steps of R code Master.pptxThe steps of R code Master.pptx
The steps of R code Master.pptx
 
Secure data transmission using dna encryption
Secure data transmission using dna encryptionSecure data transmission using dna encryption
Secure data transmission using dna encryption
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Bh36352357
Bh36352357Bh36352357
Bh36352357
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r
 
The TileDB Embedded Storage Engine
The TileDB Embedded Storage EngineThe TileDB Embedded Storage Engine
The TileDB Embedded Storage Engine
 
Dna computing
Dna computingDna computing
Dna computing
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Bio_Computing
Bio_ComputingBio_Computing
Bio_Computing
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Fann tool users_guide
Fann tool users_guideFann tool users_guide
Fann tool users_guide
 
A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
 
Web_Alg_Project
Web_Alg_ProjectWeb_Alg_Project
Web_Alg_Project
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
 

Mehr von Source Conference

Mehr von Source Conference (20)

Million Browser Botnet
Million Browser BotnetMillion Browser Botnet
Million Browser Botnet
 
iBanking - a botnet on Android
iBanking - a botnet on AndroidiBanking - a botnet on Android
iBanking - a botnet on Android
 
I want the next generation web here SPDY QUIC
I want the next generation web here SPDY QUICI want the next generation web here SPDY QUIC
I want the next generation web here SPDY QUIC
 
Extracting Forensic Information From Zeus Derivatives
Extracting Forensic Information From Zeus DerivativesExtracting Forensic Information From Zeus Derivatives
Extracting Forensic Information From Zeus Derivatives
 
How to Like Social Media Network Security
How to Like Social Media Network SecurityHow to Like Social Media Network Security
How to Like Social Media Network Security
 
Wfuzz para Penetration Testers
Wfuzz para Penetration TestersWfuzz para Penetration Testers
Wfuzz para Penetration Testers
 
Security Goodness with Ruby on Rails
Security Goodness with Ruby on RailsSecurity Goodness with Ruby on Rails
Security Goodness with Ruby on Rails
 
Securty Testing For RESTful Applications
Securty Testing For RESTful ApplicationsSecurty Testing For RESTful Applications
Securty Testing For RESTful Applications
 
Esteganografia
EsteganografiaEsteganografia
Esteganografia
 
Men in the Server Meet the Man in the Browser
Men in the Server Meet the Man in the BrowserMen in the Server Meet the Man in the Browser
Men in the Server Meet the Man in the Browser
 
Advanced Data Exfiltration The Way Q Would Have Done It
Advanced Data Exfiltration The Way Q Would Have Done ItAdvanced Data Exfiltration The Way Q Would Have Done It
Advanced Data Exfiltration The Way Q Would Have Done It
 
Adapting To The Age Of Anonymous
Adapting To The Age Of AnonymousAdapting To The Age Of Anonymous
Adapting To The Age Of Anonymous
 
Are Agile And Secure Development Mutually Exclusive?
Are Agile And Secure Development Mutually Exclusive?Are Agile And Secure Development Mutually Exclusive?
Are Agile And Secure Development Mutually Exclusive?
 
Who should the security team hire next?
Who should the security team hire next?Who should the security team hire next?
Who should the security team hire next?
 
The Latest Developments in Computer Crime Law
The Latest Developments in Computer Crime LawThe Latest Developments in Computer Crime Law
The Latest Developments in Computer Crime Law
 
JSF Security
JSF SecurityJSF Security
JSF Security
 
How To: Find The Right Amount Of Security Spend
How To: Find The Right Amount Of Security SpendHow To: Find The Right Amount Of Security Spend
How To: Find The Right Amount Of Security Spend
 
Everything you should already know about MS-SQL post-exploitation
Everything you should already know about MS-SQL post-exploitationEverything you should already know about MS-SQL post-exploitation
Everything you should already know about MS-SQL post-exploitation
 
Keynote
KeynoteKeynote
Keynote
 
Reputation Digital Vaccine: Reinventing Internet Blacklists
Reputation Digital Vaccine: Reinventing Internet BlacklistsReputation Digital Vaccine: Reinventing Internet Blacklists
Reputation Digital Vaccine: Reinventing Internet Blacklists
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

From DNA Sequence Variation to .NET Bits and Bobs

  • 1.
  • 3.
  • 4. About us We analyse files on a daily basis to determine if they are malicious and that includes Windows 8 Apps and Windows Phone apps. For the past few years we have been involved in fields like bioinformatics, molecular biology and genetics allowing us to extrapolate some of the ideas/algorithms used in the bio field and apply them to malware classification and detection purposes. About us About DNA .NET disassembler Clustering IDA plugin
  • 5.
  • 6. About DNA - DNA is made of four chemical building blocks called nucleotides: adenine (A), thymine (T), cytosine (C) and guanine (G). - A three-nucleotide series (called codon) in a DNA sequence specifies a single amino acid. - The DNA sequences are translated to amino acids that produce proteins. - Each DNA sequence that contains instructions to make a protein is known as a gene. About us About DNA .NET disassembler Clustering IDA plugin moleculesoflife2010.wikispaces.com/Protein+Structure
  • 7. About DNA sequence variation The human genome comprises about 3 billion base pairs of DNA. Due to various factors, mutations occur so the DNA sequence may change. Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”), are the most common type of genetic variation among people. Each SNP represents a difference in a single DNA building block. They can act as biological markers, helping scientists locate genes that are associated with disease.. About us About DNA .NET disassembler Clustering IDA plugin
  • 8. About GWAS A genome-wide association study (GWAS) is an approach used in genetics research to associate specific genetic variations with particular diseases. The method involves scanning the genomes (1 million SNPs) from many different people (healthy and carriers) and looking for genetic markers that can be used to predict the presence of a disease. The results of a GWAS are often displayed in a scatter plot (called a Manhattan plot), in which the peaks indicate regions of the genome associated with that disease. About us About DNA .NET disassembler Clustering IDA plugin Manhattan plot showing the −log10 P values of 606,164 SNPs in the GWAS for 1,472 Japanese atopic dermatitis (also known as atopic eczema, is a non-contagious itchy skin disorder) cases and 7,971 controls plotted against their respective positions on autosomes and the X chromosome www.nature.com/ng/journal/v44/n11/fig_tab/ng.2438_F1.html
  • 9. The DNA code is read three letters at a time (these DNA triplets are called codons) Most of the codons correspond to a specific amino acid. However some of the 64 codons code for the same amino acid. Also three of the codons are used as 'stop' signals (STOP codon) and another is the 'start' signal (START codon). This resembles the way a disassembler works. Here the binary machine code is the DNA sequence and the assembly code are the amino acids. About us About DNA .NET disassembler Clustering IDA plugin CCCTGTGGAGCCACACCCTAG CCC TGT GGA GCC ACA CCC TAG Amino acids CIL(MSIL) instructions CCC - Proline 288B00000A call TGT - Cysteine 03 ldarg.1 GGA – Glycine 7D52000004 stfld GCC - Alanine 02 ldarg.0 ACA – Threonine 04 ldarg.2 CCC - Proline 288B00000A call TAG -STOP 2A ret
  • 10.
  • 11. The CLR header can be reached from the IMAGE_DATA_DIRECTORY structure. Then we have access to the offset to the MetaData header that holds the number of streams. Immediately after, we have the headers for each stream contained inside the file. About us About DNA .NET disassembler Clustering IDA plugin typedef struct CLR_HEADER { DWORD SizeOfStructure; WORD MajorRuntimeVersion; WORD MinorRuntimeVersion; IMAGE_DATA_DIRECTORY MetaData; ….. typedef struct METADATA_HEADER { … IMAGE_DATA_DIRECTORY NoOfStreams; ….. typedef struct STREAM_HEADERSR { DWORD Offset; DWORD Size; unsigned char * Name; …..
  • 12. We are interested in #~ (the metadata stream) because it contains the information about the methods. - The #~ table header contains a bitmask-QWORD that tells us the tables present in this stream. (For example we can have the TypeRef, TypeDef, MethodDef, Field, etc. tables). Out of all, we are interested in the MethodDef table because it contains the RVAs of the method bodies. - Following the #~ header we have a set of DWORDs specifying the number of rows for each table that is present. - After them we have the actual Metadata tables. - The RVA within the MethodDef table tells us where the body of the method can be found. About us About DNA .NET disassembler Clustering IDA plugin typedef struct TABLE_HEADER { DWORD Reserved; WORD MajorVersion; WORD MinorVersion; … QWORD ValidMask; ….. typedef struct TABLE_METHODDEF { DWORD RVA; WORD ImplFlags; WORD Flags; WORD NameIndex; …..
  • 13. For each method the RVA is the offset to the first instruction. The Common Intermediate Language (CIL), formerly MSIL, instructions are encoded using a variable-length instruction encoding, where 1 or 2 bytes are used to represent the instruction. We continue to disassemble from the first instruction until we reach RET (opcode 0x2A in CIL). All the instructions are split into basic blocks and we pick only the first operand (FOP). We have a set of rules that will filter out garbage instructions. We then do a CRC on the list of FOPs and add it in the database. About us About DNA .NET disassembler Clustering IDA plugin CIL(MSIL) FOPs 288B00000A call 03 ldarg.1 7D52000004 stfld 02 ldarg.0 04 ldarg.2 288B00000A call 2A ret
  • 15. Clustering - basics Feature set: - CRCIDs representing the hashes of each FOPS present in a given file - Double[ ] file1 = [1, 32, 5673, 5674, 5675, 18001, …, 18607]; Distance measure: - Jaccard index: size of intersection divided by the size of the union of two sets. - Derivate we use: size of smallest of the two sets divided by the size of the union. - Gives a similarity value between 0 and 1, subtracting that to 1 gives us a distance measure. About us About DNA .NET disassembler Clustering IDA plugin
  • 16. Assume 0.01s on average per distance computation A simplistic implementation would give a complexity of O(n2) - Computing the distance for every possible pair of files - For example, imagine having to cluster 1500 files: (1500) 2 * 0.01 = 22500s (6.25 hours) Clearly doesn’t scale well About us About DNA .NET disassembler Clustering IDA plugin
  • 17. Our mitigation techniques to improve speed: Loading all the files in memory and ordering them by amount of FOPs they contain. Only compute distance when size ratio is within the threshold value, possible due to properties of our distance computation function. Use of prototypes for agglomerative clustering - In each cluster, the smallest file is elected as “prototype” to represent that cluster. - When doing agglomerative clustering, new files to the prototypes of each clusters until we find a distance within the threshold, or alternatively put the file in a new cluster. About us About DNA .NET disassembler Clustering IDA plugin
  • 18. About us About DNA .NET disassembler Clustering IDA plugin
  • 19. Clustering animation – Threshold = 30% 90 35 88 87 40 92 About us About DNA .NET disassembler Clustering IDA plugin
  • 20. Clustering animation – Threshold = 30% 9035 888740 92 About us About DNA .NET disassembler Clustering IDA plugin
  • 21. Clustering animation – Threshold = 30% 9035 888740 92 About us About DNA .NET disassembler Clustering IDA plugin
  • 22. Clustering animation – Threshold = 30% 90 35 888740 92 About us About DNA .NET disassembler Clustering IDA plugin
  • 23. Clustering animation – Threshold = 30% 90 35 8887 92 About us About DNA .NET disassembler Clustering IDA plugin
  • 24. Clustering animation – Threshold = 30% 908887 92 35 above threshold! About us About DNA .NET disassembler Clustering IDA plugin
  • 25. Clustering animation – Threshold = 30% 9088 87 92 35 About us About DNA .NET disassembler Clustering IDA plugin
  • 26. Clustering animation – Threshold = 30% 9088 87 92 35 About us About DNA .NET disassembler Clustering IDA plugin
  • 27. Clustering animation – Threshold = 30% 9088 87 92 35 About us About DNA .NET disassembler Clustering IDA plugin
  • 28. Clustering animation – Threshold = 30% 90 8887 92 35 About us About DNA .NET disassembler Clustering IDA plugin
  • 29. Clustering animation – Threshold = 30% 90 88 87 92 35 About us About DNA .NET disassembler Clustering IDA plugin
  • 30. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
  • 31. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
  • 32. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
  • 33. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
  • 34. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
  • 35. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
  • 36. Clustering animation – Threshold = 30% 35 87 88 About us About DNA .NET disassembler Clustering IDA plugin
  • 37. Clustering animation – Threshold = 30% 35 87 88 About us About DNA .NET disassembler Clustering IDA plugin
  • 38. About us About DNA .NET disassembler Clustering IDA plugin 312 1000 1500 4604 7380 6.655 81.644 759.799 945.557 1941.852 Clustering speed (Threshold of 80%) Number of files to cluster Time taken to complete (seconds) 840 1500 7380 3.058 14 35.475 Clustering speed (Threshold of 20%) Number of files to cluster Time taken to complete (seconds)
  • 39. Time taken to cluster the same 1500 files from the previous example is now drastically improved and follow the threshold value: - With the simplistic approach:  22500s - With mitigation techniques and threshold of 80%:  760s - With mitigation techniques and threshold of 20%:  14s About us About DNA .NET disassembler Clustering IDA plugin
  • 41. We need: - a file from the database that we know is malicious (we’ve selected Pameseg/ArchSMS) - a loose cluster that the file is part of (we’ve selected a cluster that had 399 files) Algorithm: - for each CRC present in the target file, we extract the number of files where that CRC is present - calculate the median and remove everything that’s above based on the assumption that most prevalent CRCs are clean (they are also found in clean files). After this step we got 285 files. - use the following formula to get the CRCs that are most probably malicious. k – total number of CRCs Nfi – number of files containing a specific CRC p – the default p-value (0.05) Di – distance of the specific CRC About us About DNA .NET disassembler Clustering IDA plugin
  • 42. - Using the set of data from gettinggeneticsdone.blogspot.com/2011/04/annotated-manhattan-plots-and- qq-plots.html,,(200,000 SNPs) and applying the same approach we get: About us About DNA .NET disassembler Clustering IDA plugin
  • 43. Applying the formula on our example dataset of 285 files (that was left after we applied the median) we got a similar result with the GWAS data. We took the first two CRCs and ran a query for each one in order to see which files contain them. The result was a set of 10 files, all of which were found to be malicious and from the same family (Pameseg/ ArchSMS). About us About DNA .NET disassembler Clustering IDA plugin
  • 45. About us About DNA .NET disassembler Clustering IDA plugin
  • 46. About us About DNA .NET disassembler Clustering IDA plugin
  • 47. About us About DNA .NET disassembler Clustering IDA plugin
  • 48. About us About DNA .NET disassembler Clustering IDA plugin
  • 49. About us About DNA .NET disassembler Clustering IDA plugin
  • 50.
  • 51. Similar to what geneticists are doing in order to analyse genetic variants and identify their link to various diseases, we have implemented a similar approach so it can help us to automatically identify malicious files.
  • 52. The IDA plugin shows the areas of the code that require more attention. This will reduce the time for manual analysis. We can extend the clustering algorithm to other features like instructions, behaviour data, etc. In the future we plan to extend the approach to other type of files and other platforms.
  • 53. Will this method be effective with packed files ? Weill this method be effective with obfuscated .NET files ? Does the plugin improve analysis time ? Can the CRCs be used as part of generic detections / family classification ? The effect of the speed mitigation strategies and the used a derivative of the Jaccard index ? Other questions, thoughts, etc…

Hinweis der Redaktion

  1. First we have to talk a bit about how DNA is organised. As you may already know, the DNA is made of 4 bases (A – adenine, T – thymine, C – cytosine and G – guanine). Three of these bases join together form an amino acid. There must be a precise order or sequence in the DNA, because this is used to make proteins that are responsible for most of the things that happen in our bodies. The DNA sequence that codes for protein is known as a gene.
  2. As a whole, the human genome is comprised of about 3 billion base pairs. I’ve said base pairs because DNA has two complementary strands. Each nucleotide can pair up only with it’s complement (Adenine can be paired only with Thymine and Cytosine only with Guanine). Now, everything is nice when things are working properly, but due to various factors mutations appear and the DNA sequence changes, it’s dynamic. Today we will discuss the single nucleotide polymorphism (aka SNPs). It’s the most common type of genetic variation and is represented by a change in a single nucleotide. For example Adenine changes to Cytosine. These kind of changes may actually be involved in various diseases, because now the sequence that codes for a particular protein has changed and so the gene function is changed. Researchers are using these SNPs as biological markers to locate genes that are associated with disease.
  3. One way the scientists are searching for bio markers is by using the genome-wide association study (aka GWAS) approach. They scan hundreds of human genomes from both healthy people (aka controls) and carriers and they are looking for SNPs that may predict the presence of a disease. The results are then displayed in a scatter plot, called a Manhattan plot because it resembles Manhattan’s skyline (kind of). The peaks indicate the SNPs that are most likely associated with a specific disease. For example, in this graph 600k SNPs were tested for more than1400 Japanese suffering of atopic dermatitis (an itchy skin disorder) and almost 8000 controls. The peaks indicate SNPs that are associated with that disease.
  4. Now we know why the DNA sequence is important and that any change in the nucleotides can modify the instructions that code for protein. In a way it’s similar to having a sequence of opcodes where any change can modify some instruction and the result can be a different behaviour of a program. Based on this similarity we built a .NET disassembler and developed a clustering algorithm that allows us to identify malicious code.
  5. We are going to have a quick overview of the .NET file structure and see how do we actually get to the code. From the IMAGE_DATA_DIRECTORY of a .NET file we get to the CLR_HEADER that holds a similar type of structure for the MetaData. There we can get the number of streams and immediately after we have the structure for each stream. The STREAM_HEADER structure contains the name, offset and size for each stream.
  6. Out of all streams we are interested in the metadata stream. This stream contains the Method Definition table that can get us to the code for each method. Each Method Definition structure that follows the metadata stream contains a virtual address (RVA) which points to the first instruction for each method defined in the executable.
  7. Using the Common Intermediate Language specs we developed the disassembler. This one starts at the beginning of each method and continues to disassemble the current method until we reach the RET instruction. Afterwards we split everything into basic blocks (a set of instructions that contains one entry point and one exit point so the instructions are executed exactly once, in order). From each instruction in a basic block we extract only the first operand and once we have all of them, we do a CRC that will be later added into a database.
  8. There’s 2 things to consider when we want to do clustering: Features to cluster on and the distance measure to use. In our case, in line with what Andrei just explained, we used the CRCID of each FOPS in our database. Basically all the FOPS are linked to their CRC hash / length in our database and therefore all have a unique CRCID. We build a list of CRCID for each, so they can be represent as an array like we see here. This intentionally keeps the feature representation generic so that we could end up using any other kind of feature to test different clustering approach without having the change the clustering engine itself. As for the distance measure, we used a derivate of the Jaccard Index. Instead of divided the size of the intersection by the size of the union of two sets, we divide the size of the smallest set by the size of the union of both sets. This gives us a similarity value between 0 and 1, then we subtract that to 1 to get a distance measure. You can see the basic function there with a simple implementation in C#.
  9. The speed of the getDistance operation will of course vary depending on the amount of FOPS in each file, meaning the amount of CRCID in each arrays. Based on various tests, we can assume an average speed of 0.01s per distance computation. A normal simplistic implementation of clustering where we want to calculate the distance between every pair of files before then using a different clustering algorithm on this NxN matrix of distances would give a complexity of n-squared. For example, if we have to cluster 1500 files we would get 1500 square * 0.01 = 22500 seconds .. 6.25 hours. That clearly doesn’t scale well and as a result… ..we get some very bored analysts
  10. Luckily, we don’t need to calculate the distance for each pair of file. We implemented a few mitigation techniques that help us improve the speed considerably. First, we load all the files in memory and order them amount of Fops they contain. By loading the files, I mean loading the arrays of CRCIDs associated with each files. Then, we compute the distance of two files only if their size ratio is within the threshold value. For example, if we want clusters with a maximum distance of 30% (0.30), the biggest a the two file can be maximum 30% bigger than the smallest one for us to bother computing the distance for these two files. This is possible because the file size ratio is directly used in the distance measure, so we know in advance that a file size ratio bigger than the threshold will never return a distance measure within the threshold. Also, we support agglomerative clustering, meaning we can periodically add new files to the previously known clusters or create new ones if needed. This could be daily, hourly or every time a new file enters our system. To avoid having to reset the clusters each time, we use prototypes to represent each cluster and we will only compare the new file to the prototypes of previously created clusters. We define as prototype of a cluster the smallest (fewest amount of Fops/CRCID) file it contain. You can see it as the smallest common denominator.
  11. Here’s a basic overview of the clustering algorithm: We first load all the Files IDs and their respective CRCIDs in a dictionary. We then sort that dictionary by amount of CRCIDs each file contain. Smallest file first. This operation is very quick and will be the stepping stone of most of the time savings done later. We can take the first file (File1) in the dictionary and put it in a new cluster. - We then loop on every subsequent files until the size ratio gets greater than the threshold value and calculate their distance to the selected file. - If we obtain a distance within the threshold value on a valid file (File2), we put it in the same cluster as File1 and remove it from the dictionary of files to cluster. Once we reach the end of the possible matches for File1, we remove it from the dictionary and go back to the third step until all files have been checked.
  12. Now here’s this algorithm explained with an animation. Let’s say we want to cluster these 8 files with a distance threshold of 30% (or 0.30). First step is to order them by size (or amount of Fops the file contain).
  13. Then we can start clustering..
  14. We ran a few tests to see the impact of the threshold value on the overall speed of clustering. We can see here that with a “loose” threshold of 80%, the speed curve is still looking exponential but is leaning towards a more linear form. The same 1500 files from the first example are now taking 760 seconds to cluster. Then, with a more realistic threshold of 20% we can see massive improvement in terms of speed, which is now almost linear. The set of 1500 files are now taking 14 seconds to cluster. This slide demonstrates well the impact of the mitigation technique we implemented to speed things up.
  15. Just to recap on the initial example of clustering on the set of 1500 files that was originally taking 22500 seconds with the simplistic approach in n-square complexity. With the mitigation techniques and a threshold value of 80% it takes 760 seconds to cluster, whereas it only take 14 seconds with a threshold of 20%.
  16. If you remember, in the first part of the presentation we talked about how scientists are using the Manhattan plot to identify the SNPs or bio markers used to predict a disease. Based on that approach and using the data that we’ve collected through our clustering algorithm, we’ve devised a similar technique. Here is an example of how starting from a known malicious file that is part of a cluster of a few hundred files, we managed to identify only the ones that are malicious and even more, how we identified the code that is unique to the malicious files. First we’ve got the list of CRCs that are present in the target file and for each one we’ve counted the number of files that contain the CRC The second step was to calculate the median and remove everything that’s above based on the assumption that most prevalent CRCs are clean. For each remaining CRC, using a formula similar to a chi-squared distribution with k degrees of freedom, we’ve calculated the probability that the CRC is malicious.
  17. But before we show you the results, here is an example using the same approach on a real set of genetic data, comprised of 200k SNPs. We can easily spot a few peaks that show that they are relevant to the disease.
  18. In our set of 285 files we got a similar result. A few CRCs (which are actually basic blocks) were identified as possibly malicious. In order to verify our result, we’ve queried the database to get the files that contain those CRCs. The query returned another 9 files and after the analysis we saw that they are part of the same malware family. We haven’t stopped here though. In the next slides we will see how we developed a plugin for IDA to help and speed up our analysis.
  19. The Idea behind the IDA Python Plugin is to use the information generated by the clustering to assist analysis. The plugin does this by calculating the basic blocks and then the fops of these basic blocks. It then takes these information and poles the database on the determinations on these basic blocks. It uses these determinations to colour the basic block and also adds a comment that gives the breakdown of the determination. Finally it appends the comment with the fop length and Fop CRC this can be used to get the md5 of files with this basic block by polling the database. You can then use these files for comparative analysis.
  20. This are the basic steps. First Pass -> we retrieve the basic blocks, FOPs and calculate the length and CRCs on the FOPs. Poll the database for the comments and for each CRC we get the determination for all the files that contain it. Second Pass -> recalculate the basic block and this time colour and comment each basic block.
  21. If you have looked at the dot net Intermediate Language you will constantly see pushes and pops to the stack. This is the evaluation stack and it is used to managing state information in .Net similar to how x86 uses registers. It is best to really think of CLR as a stack machine. In this vain .NET uses the maxstack function to create space on the stack for its calculations. The space created is not so relevant just that there is enough space. For that reason the maxstack instruction is not that relevant, to the programs functionality so IDA Pro does not disassemble it. Our native C disassembler did parse it, so we had discrepancies between or IDA plugin output and our native C output. After some investigation we discover that there is little value in the maxstack instructions with regard to the functionality of the basic block. So we Ignore them and we also Ignore .net “NOP”’s for similar reasons. The second major issue came with the time taken to query the database. To improve performance we added a table to give us a single query for each CRC. We believe we can improve the performance further with store procedures and an ordering of this table, as well as improvement in the python code itself.