Automated Identification of Cryptographic Algorithms using Dynamic Binary Instrumentation and Fuzzy Hashing (PANDEMONIUM

PANDEMONIUM:
Automated Identification of Cryptographic Algorithms
using Dynamic Binary Instrumentation and Fuzzy Hashing
Yuma Kurogome
CODE BLUE 2015 [U-25]
2015.10.29
1
This material is partially based upon work supported by
Asian Office of Aerospace Research and Development,
U.S. Air Force Office of Scientific Research under Award No. FA2386-15-1-4068.

$ whoami
2
• Yuma Kurogome(@ntddk)
• ntddk.github.io
Peer reviewSecurity Camp lecturer AVTOKYO speaker

Abstract
• Malware utilize many cryptographic algorithms
• To conceal messages and configurations
• DBI(Dynamic Binary Instrumentation)
• Dynamic analysis on PANDA(QEMU)
• Translate x86 code to LLVM IR(Intermediate representation) per
BB(Basic Block)
• Remove obfuscated code by optimization
• Fuzzy hash based pattern matching
• Detect and avoid anti-analysis code
• Identify cryptographic algorithms from the similarity of handling
received data
3
One entry, one exit

Malware and crypto-algorithms
4
Malware utilize many crypto-algorithms
to conceal messages and configurations
• Banking trojan
• Decrypt configuration files
• Ransomware
• Encrypt victim files
We deal with banking trojan in this researchs
Server(C&C) has key
Key is hardcoded in own body

Evolution of banking trojan
5
Malware come to birth one after
another from the black market
• Many variants were born from
leaked Zeus
• Citadel
• IceIX
• GameOver
• KINS
• New spiecies have also been born
• Dyre
• Vawtrak
• Chthonic
http://www.wontok.com/wp-content/uploads/2014/10/wdt0185_MalwareTimeline_largeV2.jpg

Banking trojan and crypto-algorithms
6
Many banking trojan utilize encrypted
configuration files and commands
• Ex. Communication between Dyre and C&C
We have to identify crypto-algorithms promptly
……
Key + IV
Encrypted data

Related work (1/2)
7
Identify crypto-algorithms by paying
attention to the arithmetic/bit operations
• Dispatcher[CCS’09]
• Find crypto-routines from insns ratio between call and ret insns
• Impossible to find if crypto-routines are made of multiple subroutines
• ReFormat[ESORICS’09]
• Find crypto-routines from the peak in the overall execution log
• Impossible to find if multiple algorithms are implemented

Related work (2/2)
8
Identify crypto-algorithms by paying
attention to the loop structures
• Aligot[CCS’11]
• Extract the input of the loop structures, and give it to known algorithms
implementation
• If output is same, algorithm is same
• The amount of calculation is O(n^2) a lot, it can only extract known crypto-algorithm
• Kerckhoffr[RAID’11]
• Extract the input of the loop structures, and compare with known algorithms
signatures
• If pattern is matched, regard as crypto-routines
• Can only extract known crypto-algorithm

Downside of related work
9
Method Known algorithms Unknown algorithms Anti anti-analysis
Dispatcher ☓
ReFormat ☓
Aligot ☓ ☓
Kerckhoffr ☓ ☓
• Previous approaches assumes execution log is infallible
• PANDEMONIUM can analyze if malware has anti-analysis
routines and has been obfuscated

Anti-analysis
10
Many malware try to detect debugger
and sandbox to avoid analysis
•
•
•
•
•
•
•
we cannot often obtain expected analysis results

There is no silver bullet
11
Analysis platform hasn’t been able to follow
complex technique of malware
•
•
•
•
•
We need extensible analysis platform

PANDEMONIUM
Avoid anti-analysis
Network
communication
Remove obfuscated
code
Identify crypto-
algotiyhms
12
Combine different approaches to identify
decrypt-routines of malware
PANDA
Guest OS malware LLVM IR Analysis log
PANDEMONIUM
Dynamic analysis Static analysis

Emulation by QEMU
• TCG(Tiny Code Generator)
13
1. Disassemble target code, and create BB(Basic Block) separated by branch insns
2. Translate BB to RISC-like TCG IR
3. Translate TCG IR to host code
4. Build chain of translated BBs and execute

PANDA[REcon’14]
14
1. Disassemble target code, and create BB(Basic Block) separated by branch insns
2. Translate BB to RISC-like TCG IR
3. Translate TCG IR to LLVM IR
4. Translate TCG IR to host code
5. Build chain of translated BBs and execute
1. 2. 3.
push esp
push ebp
push ebx
movi_i64 tmp12,$0x8260a634
st_i64 tmp12,env,$0xdae0
ld_i64 tmp12,env,$0xdad0
Can apply taint analysis and symbolic executionCallback before/after translation
We can obtain LLVM IR corresponded to malware code
%2 = add i64 %env_v, 128
%3 = inttoptr i64 %2 to i64*
store i64 2187372084, i64* %3
github.com/moyix/panda

Extract decrypt-routines (1/5)
15
OS
Malware
Obfuscated code
Anti-analysis routine
Handler to received data
……
Decrypt-routine
Obfuscated code

16
EPROCESS
ActiveProcessLi
nks
PEB
Flink
Blink
EPROCESS
ActiveProcessLi
nks
PEB
Flink
Blink
EPROCESS
ActiveProcessLi
nks
PEB
Flink
Blink
…
PsActiveProcess
Head
Flink
Blink
FS:[0x30]
KPCR
KdVersionBlock
FS:[0x1c] KDEBUGGER_DATA32
PsLoadedModuleList
+0x34 +0x70
+0x78
EPROCESS is generated when process created
panda/qemu/panda_plugins/
osi_winxpsp3x86/osi_winxpsp3x86.cpp
Extract malware process from running guest OS
(Register is different from the Windows 7 or later)
Expand

17
Malware
Obfuscated code
……
Decrypt-routine
Obfuscated code

LLVM (1/2)
18
Optimization pass of LLVM can remove
some obfuscated code
x86
Frontend
PANDA
TCG IR
LLVM IR
llvm.org

Remove obfuscated code
19
Optimization pass of LLVM can remove
some obfuscated code
• Insert dead/nop equivalent insns
• -dse, -simplifycfg
• Substitute with equivalent insns/Reorder insns
• -constprop
• -instcombine
Absorb difference of insns by implementation of compiler
(x = 14; y = x + 8) → (x = 14; y = 22)
(y = 3; ...; y = x + 1) → (...; y = x + 1)
(y = x + 2; z = y + 3) → (z = x + 5)
Cf. opticode.coseinc.com

20
Malware
……
Decrypt-routine
Obfuscated code

Anti-emulation
21
•
•
•
•
•
We also have to consider anti-emulation

Fuzzy hashing (1/2)
22
Techniques for identifying the data
that are partially different but similar
• ssdeep
• World leading security researchers will come together for this unique
international conference in Tokyo
• Bb7g86hvE/
• W0rld leading security researchers will come together for this unique
international conference in Tokyo
• GT7g86hvE/
Create signature of some anti-analysis and crypto-algorithms

Fuzzy hashing (2/2)
23
Techniques for identifying the data
that are partially different but similar
• Create fuzzy hash per BB
• Normalize operand
• Anti-analysis
• NtDelayExecution(), WaitForSingleObject(), GetCursorPos(),……
• Crypto-algorithms
• MD5, DES, RC4, ……
Create signature of some anti-analysis and crypto-algorithms
From Beecrypt, Crypto++, OpenSSL

LLVM (2/2)
24
Modify TCG IR based on pattern matching
of LLVM IR before execution
x86
Frontend
PANDA
TCG IR
LLVM IR Fuzzy hash table
Feedback
Pattern matching
llvm.org
(Red-black tree)

Symbolic execution (1/2)
25
Technique for extracting path constraints
through operation of symbolic variables
cmp eax, 0x7DF
je 0xdeadbaad
if(x!=2015)
Invalid.
ASSERT( INPUT_*_*_* =0hex7DF );
Source code Trace log Conterexample
2015 affect the branch

Symbolic execution (2/2)
26
Technique for extracting path constraints
through operation of symbolic variables
mov esi, 0x13
mov edx, 0x7DF
• Insns must be SSA(Static Single Assignment) form
• On x86, Assignment may collide
mov esi, 0x13
…
mov esi, 0x7DF
(esi == 0x13) and (edx == 0x7DF)
(esi == 0x13) and (esi == 0x7DF)
LLVM IR is suitable for symbolic execution

Anti anti-analysis
27
static inline int IsSleepPatched()
{
DWORD time1 = GetTickCount();
Sleep(500);
DWORD time2 = GetTickCount();
if ((time2- time1) > 450)
return 0;
else
return 1;
}
Avoid anti-analysis code which matched
pattern by using symbolic execution
• Ex. Avoid patch detection of Sleep()
•
• RDTSC, GetTickCount(), ……
• Which branch to go?
1. Get snapshot
2. Rewrite branch constraints
3. Long-lasting branch is taken
Or the number of expected clock is spent
(Check 50 insns)

28
Malware
……
Decrypt-routine
Obfuscated code

VMM
Taint analysis (1/2)
29
mov eax, edx
Guest OS
Technology that analyzes dependencies
between data from propagation of tag

Taint analysis (2/2)
30
Handler BB of received data from virtual
NIC would be contain decrypt-routines
• Taint source(origin of tags)
• Virtual NIC
• Taint sink(check position of tags)
• End of BB
• Propagation rule
• Reference of register and memory
r3 = Load(r2) tr3 = tr2

Anti taint analysis
31
Obfuscation technique that causes
interrupting the propagation of taint tag
• Under-tainting
• Data is not assigned directly
But we have LLVM
x = get_input();
if (x == "a")
{
uri = "c2.php";
msg = "a";
}
send(uri, msg);
x = get_input();
if (x > "a")
{
tmp = x + "a";
msg = tmp − x;
}
send(uri, msg);
-early-cse,
-constprop,
-instcombine

32
Malware
……
Decrypt-routine

Now what?
33
Handler BBs of received data from virtual
NIC would be contain decrypt-routines
Decrypt
1. Execute malware
2. Avoid anti-analysis
3. Remove obfuscated code
4. Extract handler BBs of
received data
5. Identify crypto-algorithms

Criteria for crypto-algorithm
34
Is fuzzy hash per BB useful for
Identify crypto-algorithms?
• Comparing per BB can not be maintained the uniqueness as a
signature
• There are many similar insns, many false positives
• Feature does not come out as anti-analysis routines
• Compare the whole point referring received data
• Combine their fuzzy hash, calculate LCS

Experiments
35
Experiments of crypto-algorithms
identification using PANDEMONIUM
• Experiment A: Obfuscated sample program
• Experiment B: Real-world malware

Experiment A
36
Analysis of obfuscated sample program
Algorithm Obf A Obf B
MD5
DES
RC4
AES
Blowfish
RSA
A) Insert dead/nop equivalent insns
B) Substitute with equivalent insns/Reorder insns ≒ under-tainting
Receive packet, decrypt it(by Crypto++)

Experiment B (1/3)
37
Analysis of real-world malware
• Dyre sample
• 999bc5e16312db6abff5f6c9e54c546f
• b44634d90a9ff2ed8a9d0304c11bf612
• dd207384b31d118745ebc83203a4b04a
• B44634d90a9ff2ed8a9d0304c11bf612
• 999bc5e16312db6abff5f6c9e54c546f
• Anti-analysis using PEB.NumberOfProcessors
•

Experiment B (2/3)
38
• KINS(ZeusVM) sample
• eee1bdb8d4ad98cce0031ed6ca43274a
• 84826d5e65987c131a80b1a3aa53ce17
• a2a7d4f75fc263648824facb0757a3c7
• Obfuscation by original code virtualizer
• Ex. nop(0x90) is represented as 0x32, 0x26, 0xF3
• Use

Experiment B (3/3)
39
Malware Detection ratio algorithm Cause
Dyre 4/5 RSA
KINS 0/3 RC4 VM
• PANDEMONIUM could avoid anti-analysis of Dyre
• Taint tag might have not been propagated
• Might've gone a point to be analyzed by the optimization
• LLVM is not suitable for analyzing modern code virtualizer
• Themida, ZeusVM, ……

Consideration
• Is LLVM suitable for analyzing malware?
• LLVM doesn't try to operate carry flags very much
• If the implementation improved, there might appear more features of
algorithms
• Or detection rate will vary depending on the type of encryption
algorithm?
• Varies among implementation
• Can not be affirmed for now at criteria such as whether the Feistel structure or
SPN structure
• PANDEMONIUM was compared by connecting the fuzzy hash of BBs
• It may be necessary to weight the massive block
40

Task
• Extract encryption keys
• Analyze unknown algorithms
• Should we focus on the density and the data length of the input and
output of function?
• Analyze code virtualizer
• Should we implement optimization pass?
41
We need analysis platform can follow evolution of malware

Summary
• Malware utilize many cryptographic algorithms
• To conceal messages and configurations
• Dynamic analysis on PANDA(QEMU)
• Translate x86 code to LLVM IR(Intermediate representation) per
BB(Basic Block)
• Remove obfuscated code by optimization
• Fuzzy hash based pattern matching
• Detect and avoid anti dynamic analysis code
• Identify cryptographic algorithms from the similarity of handling
received data
42
One entry, one exit

Automated Identification of Cryptographic Algorithms using Dynamic Binary Instrumentation and Fuzzy Hashing (PANDEMONIUM

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Automated Identification of Cryptographic Algorithms using Dynamic Binary Instrumentation and Fuzzy Hashing (PANDEMONIUM

Ähnlich wie Automated Identification of Cryptographic Algorithms using Dynamic Binary Instrumentation and Fuzzy Hashing (PANDEMONIUM (20)

Mehr von CODE BLUE

Mehr von CODE BLUE (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Automated Identification of Cryptographic Algorithms using Dynamic Binary Instrumentation and Fuzzy Hashing (PANDEMONIUM