This document summarizes Liwei Ren's presentation on binary similarity algorithms. It discusses three existing algorithms - ssdeep, sdhash, and TLSH - and proposes a new algorithm called TSFP. TSFP represents files as bags of blocks and measures similarity based on the overlap of blocks between two files. It is suggested as a way to solve similarity search and clustering problems by creating an index of files represented as TSFPs. The presentation concludes with inviting questions from the audience.
Ensuring Technical Readiness For Copilot in Microsoft 365
Binary Similarity Algorithms and Novel Fuzzy Hashing
1. Copyright 2011 Trend Micro Inc. 1
Binary Similarity : Theory, Algorithms and Tool
Evaluation
Liwei Ren, Ph.D, Trend Micro™
University of Houston-Downtown, Houston, Texas, October, 2015
2. Copyright 2011 Trend Micro Inc.
Agenda
• What is binary similarity ?
• Similarity Digesting: 3 Algorithms
• A Mathematical Model
• Tool Evaluation
• A Novel Fuzzy Hashing
• Summary and Further Research
Classification 10/2/2015 2
3. Copyright 2011 Trend Micro Inc.
What Is Binary Similarity?
• Binary similarity or approximate matching.
– What is binary similarity ?
• 4 Use Cases specified by a NIST document:
Classification 10/2/2015 3
4. Copyright 2011 Trend Micro Inc.
What Is Binary Similarity?
Classification 10/2/2015 4
5. Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• Similarity digesting (aka, fuzzy hashing):
– A class of hash techniques or tools that preserve similarity.
– Typical steps for digest generation:
– Detecting similarity with similarity digesting:
• Three similarity digesting algorithms and tools:
– ssdeep, sdhash & TLSH
Classification 10/2/2015 5
6. Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• ssdeep
– Two steps for digesting:
– Edit Distance: Levenshtein distance
Classification 10/2/2015 6
7. Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• Sdhash by Dr Vassil Roussev
– Two steps for digesting:
– Edit Distance: Hamming distance
Classification 10/2/2015 7
8. Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• TLSH
– Two steps for digesting :
– Edit Distance: A diff based evaluation function
Classification 10/2/2015 8
9. Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Summary of Three Similarity Digesting Schemes:
– Using a first model to describe a binary string with selected features:
• ssdeep model: a string is a sequence of chunks (split from the string).
• sdhash model: a string is a bag of 64-byte blocks (selected with entropy
values).
• TLSH model: a string is a bag of triplets (selected from all 5-grams).
– Using a second model to map the selected features into a digest which
is able to preserve similarity to certain degree.
• ssdeep model: a sequence of chunks is mapped into a 80-byte digest.
• sdhash model: a bag of blocks is mapped into one or multiple 256-byte
bloom filter bitmaps.
• TLSH model: a bag of triplets is mapped into a 32-byte container.
Classification 10/2/2015 9
10. Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Three approaches for similarity evaluation:
Classification 10/2/2015 10
• 1st model plays critical role for similarity comparison.
• Let focus on discussing various 1st models today.
• Based on a unified format.
• 2nd model saves space but further reduces accuracy.
11. Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Unified format for 1st model:
– A string is described as a collection of tokens (aka, features)
organized by a data structure:
• ssdeep: a sequence of chunks.
• sdhash: a bag of 64-byte blocks with high entropy values.
• TLSH: a bag of selected triplets.
– Two types of data structures: sequence, bag.
– Three types of tokens: chunks, blocks, triplets.
• Analogical comparison:
Classification 10/2/2015 11
12. Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Four general types of tokens from binary strings:
– k-grams where k is as small as 3,4,…
– k-subsequences: any subsequence with length k. The triplet in TLSH
is an example.
– Chunks: whole string is split into non-overlapping chunks.
– Blocks: selected substrings of fixed length.
• Eight different models to describe a string for similarity.
• Analogical thinking:
– we define different distances to describe a metric space.
Classification 10/2/2015 12
13. Copyright 2011 Trend Micro Inc.
Tool Evaluation
• Data Structure:
– Bag: a bag ignores the order of tokens. It is good at handling content
swapping.
– Sequence: a sequence organizes tokens in an order. This is weak for handling
content swapping.
• Tokens:
– k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at
handling fragmentation.
– k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at
handling fragmentation .
– Chunks: This approach takes account of every byte in raw granularity. It
should be OK at handling containment and cross sharing
– Blocks: Depending on different selection functions, even though it does not
take account of every byte, but it may present a string more efficiently and
that is good for generating similarity digests. Due to the nature of fixed
length blocks, it is good at handling containment and cross sharing.
13
14. Copyright 2011 Trend Micro Inc.
Tool Evaluation
Classification 10/2/2015 14
Tool Model Minor
Changes
Containment Cross
sharing
Swap Fragmentation
ssdeep M1.3 High Medium Medium Medium Low
sdhash M2.4 High High High High Low
TLSH M2.2 High Low Medium High High
Sdhash
+ TLSH
Hybrid High High High High High
16. Copyright 2011 Trend Micro Inc.
Tool Evaluation
• Note: vulnerability is not the scope of this evaluation , but worthy for
mentioning.
• My co-worker Dr. Jon Oliver shows in one of his papers :
– Both ssdeep & sdhash are vulnerable in terms of adversary attacks.
– TLSH is not !
16
17. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• We like to design a novel fuzzy hashing scheme based on the
M2.4:
– a string is presented by a bag of blocks.
– Two steps: (1) Feature selection; (2) Digest generation.
Classification 10/2/2015 17
18. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Continuing:
Classification 10/2/2015 18
19. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• This is TSFP
– Trend String Fingerprint
• Similarity measurement of TSFP:
– Given two TSFP H and G where H = h1h2… hn and G= g1g2… gm .
– Similarity is measured by function:
• SIMH(H,G) = 200*|S ⋂T| / (|S| + |T|)
– Where S = {h1, h2, … ,hn } and T = {g1, g2, … , gm }
– 0 ≤ SIMH(G,H) ≤ 100
• Similarity measurement of two strings :
– SIM(s,t) = SMTH(TSFP(s), TSFP(h))
Classification 10/2/2015 19
20. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Why do we need TSFP ?
• We need to solve the following problems
1. Similarity search problem:
• B is a bag of binary strings {t1, t2 , …,tn} Given δ >0 and a binary string s,
find t ϵ {t1, t2 , …,tn} such that SIM(s, t) ≥δ.
2. Similarity based clustering problem:
• B is a bag of binary strings {{t1, t2 , …, tn }. Partition B into groups based
on their binary similarity.
• Why not {ssdeep, sdhash & TLSH} ?
– An obvious solution is applying a Brute Force algorithm.
– NOTE: Jon Oliver uses random forest to solve the search problem
without Brute Force. I will try to prove its feasibility mathematically.
20
21. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Similarity search problem:
• B is a bag of binary strings {t1, t2 , …, tn }. Given δ >0 and a binary string
s, find t ϵ {t1, t2 , …, tn} such that SIM(s, t) ≥δ .
• How does keyword based search engine work?
– Extracting keywords from documents
– Indexing keywords & documents
– Searching via keywords.
• Solution:
– Given a string s, we get its fuzzy hash TSFP(s)= h1h2… hn .
– Let S={h1, h2,…,hn}, each hj is a token of s that we treat it as a
keyword. So we can create the indices TSFP-Index (B).
– We can do two steps to solve the searching problems above.
21
22. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Similarity search problem:
• B is a bag of binary strings {t1, t2 , …, tn }. Given δ >0 and a binary string
s, find t ϵ {t1, t2 , …, tn} such that SIM(s, t) ≥δ .
• STEP 1:
– Candidate selection
• Let TSFP(s)= h1h2… hn to create the bag of tokens S={h1, h2,…, hn}.
• Use this bag of tokens to search the indices TSFP-Index(B) so that we
retrieve a list of candidates {s1, s2 , …, sm} ⊂ {t1, t2 , …, tn } ranked by
number of common tokens.
• STEP 2:
– Brute force method at smaller scale
• For each t ϵ {s1, s2 , …, sm}, if SIM( s, r) ≥δ , t is what we are searching for.
22
23. Copyright 2011 Trend Micro Inc.
Summary and Further Research
• My practice of academic research in industry:
Classification 10/2/2015 23
24. Copyright 2011 Trend Micro Inc.
Summary and Further Research
Framework of approximate matching, searching and clustering:
Classification 10/2/2015 24
25. Copyright 2011 Trend Micro Inc.
Q&A
• Thank you for your interest.
• Any questions?
• My Information:
– Email: liwei_ren@trendmicro.com
– Academic Page: https://pitt.academia.edu/LiweiRen
Classification 10/2/2015 25