«Технологии анализа бинарного кода приложений: требования, проблемы, инструменты», Константин Панарин (Positive Technologies)
1. ptsecurity.ru
Techniques of binary code analysis –
methods, problems, tools
Konstantin Panarin,
low-level application analysis group developer
2. • Konstantin Panarin, Positive Technologies, kpanarin@ptsecurity.com
• Developer of low-level application analysis group
#whoami
3. • Purposes of binary analysis
• Overview of techniques and arising problems
• Overview of modern analysis tools
AGENDA
4. • Error discovering
• Vulnerability discovering
• Searching for backdoors and undocumented features
• Recovery of program logic(RE)
• Tests generation (more details later)
Purposes of binary analysis
5. Specificity of binary analysis
• Almost complete lack of type-related information
• Intensive use of obfuscation and antidebugging techniques: CFG flattening, virtual
machines, dead code insertion
• Complexity of gathering of “meta information” of different kinds (exception
handlers)
• Semantic difficulty of particular assembler instructions (especially considering x86
instruction set): XLAT, DIV, CMPXCHG and so on
6. Types of analysis:
• Static analysis
• No program execution
• Dynamic analysis
• Analysis of one execution trace per every program run
• Combined analysis
Analysis techniques:
• Symbolic execution
• As a general rule, used in static analysis
• Analysis of marked data (taint analysis)
• As a general rule, used in dynamic analysis
• Fuzzing
• Expected input data is replaced by randomly generated bytes
• And many others
Methods of binary analysis
7. Types of analysis:
• Static analysis
• No program execution
• Dynamic analysis
• Analysis of one execution trace per every program run
• Combined analysis
Analysis techniques:
• Symbolic execution
• As a general rule, used in static analysis
• Analysis of marked data (taint analysis)
• As a general rule, used in dynamic analysis
• Fuzzing
• Expected input data is replaced by randomly generated bytes
• And many others
Methods of binary analysis
In practice, analysis tools use a mixture of different techniques because each instrumentation method has its own
restrictions. Consistent use of various approaches helps to partially (and sometimes totally) overcome their
limitations.
8. Static vs dynamic analysis
Dynamic analysis:
• Availability of run-time information:
process memory map, addresses
of indirect calls and so on
• Program execution may require
specific environment
• It’s not always possible to
reproduce the results of previously
ran analysis
Static analysis:
• In common, works faster
• One analysis run is potentially able
to cover infinite number of
execution paths
• Able to work in the case of absence
of some parts of codelibraries
• Unable to cope with obfuscation
and encryption
9. • Key idea – replacement of concrete input data (eg function
arguments) by symbolic values
• Analysis tool operates on symbolic expressions instead of their
concrete counterpart
• Symbolic execution is able to cover all execution paths on single run
• Every execution path represents a “state” of program, which holds all the
constraints on crafted symbolic variables (path and value constraints)
• SMT-solver – tool, designed to resolve constraints on symbolic variables
Techniques – symbolic execution
10. int twice(int v) {
return 2 * v;
}
void test(int x, int y) {
z = twice(y);
if (x == z) {
if (x > y + 10)
ERROR;
}
}
int main() {
x = read();
y = read();
test(x,y);
}
Symbolic execution: example
Let’s find the values of x and y,
which will force execution flow to
reach the label ERROR
Taken from http://www.srl.inf.ethz.ch/pa2015/Lecture8.pdf
11. int twice(int v) {
return 2 * v;
}
void test(int x, int y) {
z = twice(y);
if (x == z) {
if (x > y + 10)
ERROR;
}
}
int main() {
x = read();
y = read();
test(x,y);
}
Symbolic execution: example
Value constraints:
X->x0
Y->y0
Path constraints:
True
12. int twice(int v) {
return 2 * v;
}
void test(int x, int y) {
z = twice(y);
if (x == z) {
if (x > y + 10)
ERROR;
}
}
int main() {
x = read();
y = read();
test(x,y);
}
Symbolic execution: example
Value constraints:
X->x0
Y->y0
Z->2*y0
Path constraints:
True
13. int twice(int v) {
return 2 * v;
}
void test(int x, int y) {
z = twice(y);
if (x == z) {
if (x > y + 10)
ERROR;
}
}
int main() {
x = read();
y = read();
test(x,y);
}
Symbolic execution: example
Value constraints:
X->x0
Y->y0
Z->2*y0
Path constraints:
x0 = 2y0
Value constraints:
X->x0
Y->y0
Z->2*y0
Path constraints:
x0 != 2y0
Create two different states after conditional branch - if (x==z)
14. int twice(int v) {
return 2 * v;
}
void test(int x, int y) {
z = twice(y);
if (x == z) {
if (x > y + 10)
ERROR;
}
}
int main() {
x = read();
y = read();
test(x,y);
}
Symbolic execution: example
Value constraints:
X->x0
Y->y0
Z->2*y0
Path constraints:
x0 =2y0 ^ x0 > y0+10
Value constraints:
X->x0
Y->y0
Z->2*y0
Path constraints:
x0 =2y0 ^ x0 <= y0+10
15. int twice(int v) {
return 2 * v;
}
void test(int x, int y) {
z = twice(y);
if (x == z) {
if (x > y + 10)
ERROR;
}
}
int main() {
x = read();
y = read();
test(x,y);
}
Symbolic execution: example
Value constraints:
X->x0
Y->y0
Z->2*y0
Path constraints:
x0 = 2y0 ^ x0 > y0+10
Reachability condition of label ERROR:
16. int twice(int v) {
return 2 * v;
}
void test(int x, int y) {
z = twice(y);
if (x == z) {
if (x > y + 10)
ERROR;
}
}
int main() {
x = read();
y = read();
test(x,y);
}
Symbolic execution: example
Value constraints:
X->x0
Y->y0
Z->2*y0
Path constraints:
x0 = 2y0 ^ x0 > y0+10
Reachability condition of label ERROR:
SMT Solver gives the following solution:
x0 = 40, y0 = 20
17. • Symbolic execution was originally used for tests generation (in the 70s of the last
century):
• Input values of tested program are marked as symbolic
• In case of program errorvulnerability at some point in the executable we resolve the
reachability constraints to this point
• Found conditions on input data form the required test
• This scheme is widely used in different verification systems
Symbolic execution: applications
18. • Symbolic execution: general scheme
translator
into IR
assembler instruction
Set of IR instructions
Pool of states
(one state per execution
path)
State №1
State №2
State №…
State №500
Each state holds the following data:
• Current IP (instruction pointer)
• Symbolic context (registers, memory
cells)
• Constraints
Executor (director)
– processes
particular state
X86: mov eax, ecx
___________________
IR:
STR R_ECX:32, , V_00:32
STR V_00:32, , R_EAX:32
Interpreter –
contains handlers
for each IR
instruction
translation
Conditional branch with condition
X
If some “interesting” point is
reached, check its reachability:
extract path constraints from
state and solve corresponding
smt task
SMT-Solvers: Z3,
STP, Boolector
New state a:
Constraints += X
New state b:
Constraints += ~X
Add new
states into
the pool
Searcher selects state
from
the pool
19. Symbolic execution – existing problems
• path explosion (how to generate smaller number of states?)
• cycle-unrolling (how to process cycles, the exit condition of which
depend on symbolic variable?)
• symbolic pointers (how to handle load and store operations, the
address of which are also symbolic?)
• constraint difficulty (some generated constraints are too difficult for
all SMT-solvers to evaluate and find exact solutions)
• external resources (how to process file handlers and other
references to external objects?)
20. Symbolic execution – possible solutions
• path explosion – merge (unite) several states into bigger one (but
when and how?)
• path explosion – simultaneous processing of various states (parallel
symbolic execution).
• cycle unrolling, symbolic pointers – use specific SMT-logics (how
effective is it?)
• external resources – use DSL in order to describe external calls in
terms of solver‘s expressions
21. • Purely dynamic analysis method
• Connects trace with the data processed by program in the course of
execution
• Answers the question of how the program processes certain pieces
of input data
Analysis of marked data (taint analysis)
22. Taint analysis: basic idea
• Main concepts: shadow memory and taint propagation.
Shadow memory
Taint propagation
23. mov eax, tainted_input
xor eax, eax ; eax is UNTAINTED
-----------------------------------------
push tainted_input
pop eax ; eax is TAINTED,
-----------------------------------------------------------------
xor eax, eax
cmp eax, tainted_input ; AF, CF, OF, PF, SF, ZF
are TAINTED
Taint propagation: examples
mov eax, tainted _input
mov ecx, untainted_input
add ecx, eax ; ecx is TAINTED
-----------------------------------------
mov eax, tainted_input
mov ecx, untainted_input
mov ax, cx ; ax is UNTAINTED, eax is TAINTED
-----------------------------------------------------------------
Taken from http://defcon.org.ua/data/1/4_Oleksyk_Code_Analysis.pdf
24. Taint analysis: scheme
Program code:
___________________
__
push ebp
Mov ebp, esp
lea eax, [esp+8]
…
ret
Runtime analysis of
machine instructions
add eax, [esp+8]
Instruction handler:
Syntax parsing,
extraction of instruction’s
operands,
address resolution (for
memory operands)
Taint context
EBX: not tainted
Taint propagation
ECX: tainted
…
EDI: tainted
EAX: not tainted
SHADOW MEMORY
Operands:
dest - eax,
src: eax, 0x7f2300
Context reading:
eax – not tainted
0x7f2300 - tainted
Context
writing:
eax – tainted
25. Taint analysis
Usage of taint-analysis:
• Tainted EIP suggest the possibility of control flow hijacking (for example, as a result of
stackheap overflow).
• Tainted arguments of particular functions (printf family of functions, system) suggest the
possibility of a vulnerability.
• Tainted resources (handlers, mutexes, that do not depend directly on use input) suggest the
possibility of logical error in the program.
Drawbacks:
• Requires detailed analysis of every assembler instruction, which may be tedious for some
architectures (x86)
• Ideal taints analysis should instrument the entire code executed by the operating system (both
in user-mode and in kernel-mode) which is not always possible.
26. Combined analysis aka concolic execution
Concrete + symbolic = concolic:
• Create snapshots of the entire process on chosen control points
• Instrument concrete execution trace and at the same time fill the queue of symbolic constraints:
for each conditional branch on the trace push its constraints into the queue
• Roll back to the previous control point, select new symbolic condition from the queue, solve its
corresponding SMT-task and substitute found solution into exact execution context (registers and
memory areas)
• Instrument new trace with new parameters
27. Existing tools (OpenSource)
KLEE
• Created as test generation tool with high coverage
• Based on LLVM IR
• Uses symbolic execution
• Automatic test generation
28. Existing tools (OpenSource)
Triton
• Uses concolic execution
• Directly converts asm instructions into solver’s (Z3) expressions
(bypassing internal representation)
Other: FuzzBall, BitBlaze
• There are no tools of proper product quality
• Every tool is focused on solving one particular task
29. Existing tools(ClosedSource)
MAYHEM
• Designed to automatically search for vulnerabilities and generate
exploits
• Able to work with symbolic pointers
• Winner of DARPA contest in 2016
CodeSurfer, VeraCode
• A little is known about their inner structure
30. Conclusion
• Methods of binary analysis still require a lot of careful research
• At present, there doesn’t exist a universal instrument of binary
analysis
• Every tool is focused on solving one particular task
• Positive Technologies is working on its own tool – STAY TUNED!