Malware utilize many cryptographic algorithms.
To fight against malware, analysts have to reveal details on malware activities.
Accordingly, it is important to identify cryptographic algorithms used in malware.
In this track, I propose a faster and extensible method to automatically detect known cryptographic algorithms in malware using dynamic binary instrumentation and fuzzy hashing.
Automated Identification of Cryptographic Algorithms using Dynamic Binary Instrumentation and Fuzzy Hashing (PANDEMONIUM
1. PANDEMONIUM:
Automated Identification of Cryptographic Algorithms
using Dynamic Binary Instrumentation and Fuzzy Hashing
Yuma Kurogome
CODE BLUE 2015 [U-25]
2015.10.29
1
This material is partially based upon work supported by
Asian Office of Aerospace Research and Development,
U.S. Air Force Office of Scientific Research under Award No. FA2386-15-1-4068.
3. Abstract
• Malware utilize many cryptographic algorithms
• To conceal messages and configurations
• DBI(Dynamic Binary Instrumentation)
• Dynamic analysis on PANDA(QEMU)
• Translate x86 code to LLVM IR(Intermediate representation) per
BB(Basic Block)
• Remove obfuscated code by optimization
• Fuzzy hash based pattern matching
• Detect and avoid anti-analysis code
• Identify cryptographic algorithms from the similarity of handling
received data
3
One entry, one exit
4. Malware and crypto-algorithms
4
Malware utilize many crypto-algorithms
to conceal messages and configurations
• Banking trojan
• Decrypt configuration files
• Ransomware
• Encrypt victim files
We deal with banking trojan in this researchs
Server(C&C) has key
Key is hardcoded in own body
5. Evolution of banking trojan
5
Malware come to birth one after
another from the black market
• Many variants were born from
leaked Zeus
• Citadel
• IceIX
• GameOver
• KINS
• New spiecies have also been born
• Dyre
• Vawtrak
• Chthonic
http://www.wontok.com/wp-content/uploads/2014/10/wdt0185_MalwareTimeline_largeV2.jpg
6. Banking trojan and crypto-algorithms
6
Many banking trojan utilize encrypted
configuration files and commands
• Ex. Communication between Dyre and C&C
We have to identify crypto-algorithms promptly
……
Key + IV
Encrypted data
7. Related work (1/2)
7
Identify crypto-algorithms by paying
attention to the arithmetic/bit operations
• Dispatcher[CCS’09]
• Find crypto-routines from insns ratio between call and ret insns
• Impossible to find if crypto-routines are made of multiple subroutines
• ReFormat[ESORICS’09]
• Find crypto-routines from the peak in the overall execution log
• Impossible to find if multiple algorithms are implemented
8. Related work (2/2)
8
Identify crypto-algorithms by paying
attention to the loop structures
• Aligot[CCS’11]
• Extract the input of the loop structures, and give it to known algorithms
implementation
• If output is same, algorithm is same
• The amount of calculation is O(n^2) a lot, it can only extract known crypto-algorithm
• Kerckhoffr[RAID’11]
• Extract the input of the loop structures, and compare with known algorithms
signatures
• If pattern is matched, regard as crypto-routines
• Can only extract known crypto-algorithm
9. Downside of related work
9
Method Known algorithms Unknown algorithms Anti anti-analysis
Dispatcher ☓
ReFormat ☓
Aligot ☓ ☓
Kerckhoffr ☓ ☓
• Previous approaches assumes execution log is infallible
• PANDEMONIUM can analyze if malware has anti-analysis
routines and has been obfuscated
10. Anti-analysis
10
Many malware try to detect debugger
and sandbox to avoid analysis
•
•
•
•
•
•
•
we cannot often obtain expected analysis results
11. There is no silver bullet
11
Analysis platform hasn’t been able to follow
complex technique of malware
•
•
•
•
•
We need extensible analysis platform
13. Emulation by QEMU
• TCG(Tiny Code Generator)
13
1. Disassemble target code, and create BB(Basic Block) separated by branch insns
2. Translate BB to RISC-like TCG IR
3. Translate TCG IR to host code
4. Build chain of translated BBs and execute
14. PANDA[REcon’14]
• DBI(Dynamic Binary Instrumentation)
14
1. Disassemble target code, and create BB(Basic Block) separated by branch insns
2. Translate BB to RISC-like TCG IR
3. Translate TCG IR to LLVM IR
4. Translate TCG IR to host code
5. Build chain of translated BBs and execute
1. 2. 3.
push esp
push ebp
push ebx
movi_i64 tmp12,$0x8260a634
st_i64 tmp12,env,$0xdae0
ld_i64 tmp12,env,$0xdad0
Can apply taint analysis and symbolic executionCallback before/after translation
We can obtain LLVM IR corresponded to malware code
%2 = add i64 %env_v, 128
%3 = inttoptr i64 %2 to i64*
store i64 2187372084, i64* %3
github.com/moyix/panda
15. Extract decrypt-routines (1/5)
15
Combine different approaches to identify
decrypt-routines of malware
OS
Malware
Obfuscated code
Anti-analysis routine
Handler to received data
……
Decrypt-routine
Obfuscated code
17. Extract decrypt-routines (2/5)
17
Combine different approaches to identify
decrypt-routines of malware
Malware
Obfuscated code
Anti-analysis routine
Handler to received data
……
Decrypt-routine
Obfuscated code
19. Remove obfuscated code
19
Optimization pass of LLVM can remove
some obfuscated code
• Insert dead/nop equivalent insns
• -dse, -simplifycfg
• Substitute with equivalent insns/Reorder insns
• -constprop
• -instcombine
Absorb difference of insns by implementation of compiler
(x = 14; y = x + 8) → (x = 14; y = 22)
(y = 3; ...; y = x + 1) → (...; y = x + 1)
(y = x + 2; z = y + 3) → (z = x + 5)
Cf. opticode.coseinc.com
20. Extract decrypt-routines (3/5)
20
Combine different approaches to identify
decrypt-routines of malware
Malware
Anti-analysis routine
Handler to received data
……
Decrypt-routine
Obfuscated code
22. Fuzzy hashing (1/2)
22
Techniques for identifying the data
that are partially different but similar
• ssdeep
• World leading security researchers will come together for this unique
international conference in Tokyo
• Bb7g86hvE/
• W0rld leading security researchers will come together for this unique
international conference in Tokyo
• GT7g86hvE/
Create signature of some anti-analysis and crypto-algorithms
23. Fuzzy hashing (2/2)
23
Techniques for identifying the data
that are partially different but similar
• Create fuzzy hash per BB
• Normalize operand
• Anti-analysis
• NtDelayExecution(), WaitForSingleObject(), GetCursorPos(),……
• Crypto-algorithms
• MD5, DES, RC4, ……
Create signature of some anti-analysis and crypto-algorithms
From Beecrypt, Crypto++, OpenSSL
24. LLVM (2/2)
24
Modify TCG IR based on pattern matching
of LLVM IR before execution
x86
Frontend
PANDA
TCG IR
LLVM IR Fuzzy hash table
Feedback
Pattern matching
llvm.org
(Red-black tree)
25. Symbolic execution (1/2)
25
Technique for extracting path constraints
through operation of symbolic variables
cmp eax, 0x7DF
je 0xdeadbaad
if(x!=2015)
Invalid.
ASSERT( INPUT_*_*_* =0hex7DF );
Source code Trace log Conterexample
2015 affect the branch
26. Symbolic execution (2/2)
26
Technique for extracting path constraints
through operation of symbolic variables
mov esi, 0x13
mov edx, 0x7DF
• Insns must be SSA(Static Single Assignment) form
• On x86, Assignment may collide
mov esi, 0x13
…
mov esi, 0x7DF
(esi == 0x13) and (edx == 0x7DF)
(esi == 0x13) and (esi == 0x7DF)
LLVM IR is suitable for symbolic execution
27. Anti anti-analysis
27
static inline int IsSleepPatched()
{
DWORD time1 = GetTickCount();
Sleep(500);
DWORD time2 = GetTickCount();
if ((time2- time1) > 450)
return 0;
else
return 1;
}
Avoid anti-analysis code which matched
pattern by using symbolic execution
• Ex. Avoid patch detection of Sleep()
•
• RDTSC, GetTickCount(), ……
• Which branch to go?
1. Get snapshot
2. Rewrite branch constraints
3. Long-lasting branch is taken
Or the number of expected clock is spent
(Check 50 insns)
28. Extract decrypt-routines (4/5)
28
Combine different approaches to identify
decrypt-routines of malware
Malware
Handler to received data
……
Decrypt-routine
Obfuscated code
29. VMM
Taint analysis (1/2)
29
mov eax, edx
Guest OS
Technology that analyzes dependencies
between data from propagation of tag
30. Taint analysis (2/2)
30
Handler BB of received data from virtual
NIC would be contain decrypt-routines
• Taint source(origin of tags)
• Virtual NIC
• Taint sink(check position of tags)
• End of BB
• Propagation rule
• Reference of register and memory
r3 = Load(r2) tr3 = tr2
31. Anti taint analysis
31
Obfuscation technique that causes
interrupting the propagation of taint tag
• Under-tainting
• Data is not assigned directly
But we have LLVM
x = get_input();
if (x == "a")
{
uri = "c2.php";
msg = "a";
}
send(uri, msg);
x = get_input();
if (x > "a")
{
tmp = x + "a";
msg = tmp − x;
}
send(uri, msg);
-early-cse,
-constprop,
-instcombine
33. Now what?
33
Handler BBs of received data from virtual
NIC would be contain decrypt-routines
Decrypt
1. Execute malware
2. Avoid anti-analysis
3. Remove obfuscated code
4. Extract handler BBs of
received data
5. Identify crypto-algorithms
34. Criteria for crypto-algorithm
34
Is fuzzy hash per BB useful for
Identify crypto-algorithms?
• Comparing per BB can not be maintained the uniqueness as a
signature
• There are many similar insns, many false positives
• Feature does not come out as anti-analysis routines
• Compare the whole point referring received data
• Combine their fuzzy hash, calculate LCS
36. Experiment A
36
Analysis of obfuscated sample program
Algorithm Obf A Obf B
MD5
DES
RC4
AES
Blowfish
RSA
A) Insert dead/nop equivalent insns
B) Substitute with equivalent insns/Reorder insns ≒ under-tainting
Receive packet, decrypt it(by Crypto++)
37. Experiment B (1/3)
37
Analysis of real-world malware
• Dyre sample
• 999bc5e16312db6abff5f6c9e54c546f
• b44634d90a9ff2ed8a9d0304c11bf612
• dd207384b31d118745ebc83203a4b04a
• B44634d90a9ff2ed8a9d0304c11bf612
• 999bc5e16312db6abff5f6c9e54c546f
• Anti-analysis using PEB.NumberOfProcessors
•
38. Experiment B (2/3)
38
Analysis of real-world malware
• KINS(ZeusVM) sample
• eee1bdb8d4ad98cce0031ed6ca43274a
• 84826d5e65987c131a80b1a3aa53ce17
• a2a7d4f75fc263648824facb0757a3c7
• Obfuscation by original code virtualizer
• Ex. nop(0x90) is represented as 0x32, 0x26, 0xF3
• Use
39. Experiment B (3/3)
39
Analysis of real-world malware
Malware Detection ratio algorithm Cause
Dyre 4/5 RSA
KINS 0/3 RC4 VM
• PANDEMONIUM could avoid anti-analysis of Dyre
• Taint tag might have not been propagated
• Might've gone a point to be analyzed by the optimization
• LLVM is not suitable for analyzing modern code virtualizer
• Themida, ZeusVM, ……
40. Consideration
• Is LLVM suitable for analyzing malware?
• LLVM doesn't try to operate carry flags very much
• If the implementation improved, there might appear more features of
algorithms
• Or detection rate will vary depending on the type of encryption
algorithm?
• Varies among implementation
• Can not be affirmed for now at criteria such as whether the Feistel structure or
SPN structure
• PANDEMONIUM was compared by connecting the fuzzy hash of BBs
• It may be necessary to weight the massive block
40
41. Task
• Extract encryption keys
• Analyze unknown algorithms
• Should we focus on the density and the data length of the input and
output of function?
• Analyze code virtualizer
• Should we implement optimization pass?
41
We need analysis platform can follow evolution of malware
42. Summary
• Malware utilize many cryptographic algorithms
• To conceal messages and configurations
• DBI(Dynamic Binary Instrumentation)
• Dynamic analysis on PANDA(QEMU)
• Translate x86 code to LLVM IR(Intermediate representation) per
BB(Basic Block)
• Remove obfuscated code by optimization
• Fuzzy hash based pattern matching
• Detect and avoid anti dynamic analysis code
• Identify cryptographic algorithms from the similarity of handling
received data
42
One entry, one exit