Ruxmon.2013-08.-.CodeBro!

Improving static code
review using AST-based
code analysis
Christophe Alladoum
@_hugsy_
hugsy

Who am I ?
➔ Christophe Alladoum
➔ IOActive pirate
➔ blah blah blah

What about ?
➔ I read a LOT of code
◆ mostly for fun (eventually for work)
● just to know how it works
● occasionally to find bugs
◆ most of the time, C code
● sometimes C++
● occasionally higher level stuff: PHP (lol), Java,
Python, ...

What about ?
➔ C code is tricky & not trivial
● many standards (ANSI C - C89, C99, C11, etc..)
● many bad coding practices
● MANY subtleties in the language
➔ Ergo, many places for flaws
● logic errors
● programming errors
● lack of restriction in code (buffers, integers)
I like

Existing automated tools
● Many Open-Source & licenced ($$$) tools use regexp to
find weak patterns
● Insufficient approach :
○ Example using latest flawfinder :
○ Basically as clever as making a `grep`
which is one of the best vuln finder btw
Ok, thanks !

Existing automated tools
○ and (too) many times, there are “strange” results
○ Usually a very *bad* idea to just paste output from
those tools in a (serious) code review report
*PLUS* splint fails to
see vulnerable calls

A smarter approach
➔ C based code projects are ultimately made
to be compiled & linked
◆ Compilers are the best code reviewers !!
● Code is parsed and transformed into another format
● Code is validated
● Some additional checks are even provided by default for
programming errors (type checks, unused vars, invalid
formatted strings, uninitialized values, etc…)

Quick reminder on compilers
● Compiler, noun : set of programs that transforms source code written in a
programming language into another computer language (Wikipedia).
■ Examples : GCC, as, Python ( which embeds a JIT compiler), etc...
● Abstract representation of compiler behavior:

LLVM Specifics
● What makes LLVM so special ?
○ LLVM (Low-Level Virtual Machine) : 13 year old project
○ Many different projects around this architecture
○ LLVM structure *truly* isolates each part
(lexing/optimizing/generating)
○ Totally Plug-and-Play
● you can easily write a lexer for generating Python .pyc file ...
● … or you can use optimizer API to help runtime bug detection (heard of Google
AddressSanitizer module ?) …
● … or you can use an existing parser (for instance GCC’s) and bind it to the rest
of the LLVM architecture (llvm-gcc)
→ really cool features ! Go
hack it !!

LLVM Specifics
● Clang
○ Default C/C++/Obj-C compiler based for LLVM architecture
○ Parser gets .c, .cpp, .m files as input and generates an
Intermediate Representation (IR) of the code
→ this is achieved thanks to an Abstract Syntax Tree (AST)
created when “reading” each source file
○ An API is provided to interact with the generated AST
→ in native C++
→ or higher languages, like Python
■ This means that Clang parses the code for us, then why not use
this to parse code in a smart way (and ultimately find
vulnerabilities) ?

Clang Python API
● Relatively easy to use...
○ … but not enough thoroughly documented (just automatically generated documentation)
→ pydoc works fairly well on it
○ Many blog posts (but sometimes outdated on the topic)
○ Namespace fairly intuitive
Basic example : outputs

Demo
● clang-draw-ast.py is a 70-line Python script that will parse a C source
file and display (PNG format) the corresponding AST.

(This is the expected result if live demo fails)

The magic inside
Indexation engine API is exposed by `clang.cindex` package.
● Index
○ top-level object which manages some global library state.
● TranslationUnit
○ High-level object encapsulating the AST for a single translation unit
(parsed on the fly)
● SourceRange, SourceLocation, and File
○ Objects representing information about the input source.

Clang internals voodoo
The routines in this group provide the
ability to create and destroy translation
units from files, either by parsing the
contents of the files or by reading in a
serialized representation of a
translation unit.
● Once indexation engine is created, parse() function
will output a TranslationUnit object
○ The most important object
● Cursor object that will iterate through all nodes
○ kind : declare the type of the current node
○ displayname : display name for the entity referenced
○ location : returns the source location (the starting
character)
○ get_children() : return an iterator for accessing the children of
this cursor
○ get_arguments(): return an iterator for accessing the arguments
of this cursor

Clang internals voodoo
Now we can better understand the previous script
Easy, right ?
1
2
3
4

Pros / Cons
Pros
● simple and intuitive Python bindings
● full control over all the code being audited
● parsing and browsing are fast
● can be extended with LLVM extra modules
Cons
● generated over Python ctypes : might not work as well for other high
level languages (Ruby, Java, etc.)
Limitations ?
● Many developments, API keeps on improving and docs becoming more
complete

Introducing CodeBro!
● Built as a Proof-of-Concept around this idea
○ Meaning : you can use it but don’t rely on it
● Underlying idea : create a web-based tool that would interface between
AST and code reviewer
○ Code reviewer can smartly analyse/navigate through code and
eventually add some modules to detect basic (or advanced)
vulnerabilities

CodeBro!
● 100% Open-Source
○ Beer-Ware License
● 100% full Python
● (Hopefully) Easily installable (pip)
● Django (compat. 1.5+) based application
○ combines many cool Python based technologies
■ PyDot
■ PyCharm
■ Pygments
■ etc.
○ Allows to keep things simple
■ 1 project to audit = 1 specific database (default : SQLite)

CodeBro!
● Uses Clang parsing module to dynamically
interact with code
○ Cross-referencing feature similar to IDA Pro
■ only between functions (caller/callee)
○ call graphs generation : visual understanding of code
■ SVG generated graph → can be browsed through browser

CodeBro!
● “Analysis” module
○ reports all default diagnostics provided by Clang
○ provides a “Plugin” API
■ some modules implemented
■ … some more to come

CodeBro!
● Extensible through plugins
○ can use AST and/or already existing references
○ Examples :
■ detecting dead code
● find all functions never called (i.e. no down Xref to it)
■ improving format string flaws detection
● “count” number of args for known functions (printf, sprintf,
etc.) and parse the arguments
● detect formatted string wrapping functions (based on former
calls)
■ (in a limited extent)
detect use-after-free like this →

Demo time
(More screenshots if demo still fails)

Code browsing - unparsed
then parsed

Call graph generation : SVG generation (href linking)
← Functions listing

Future enhancements
● Still a work in progress
● Fix bugs
● Index all components of source files (instead of just CALL_EXPR and
FUNCTION_DECL)
● Improve search engine
● Add macro parsing
● Integrate more source code input vector (GIT - as soon as there is a decent
Python GIT bindings package)
● Improve C++ and Objective-C analysis
● Add moar modulez !!

Links :
● https://github.com/hugsy/codebro
● https://twitter.com/_hugsy_
● http://eli.thegreenplace.net/2011/07/03/parsing-c-in-python-with-clang
● http://llvm.org/devmtg/2010-11/Gregor-libclang.pdf
● https://code.google.com/p/address-sanitizer/wiki/AddressSanitizer

Ruxmon.2013-08.-.CodeBro!

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Ruxmon.2013-08.-.CodeBro!

Ähnlich wie Ruxmon.2013-08.-.CodeBro! (20)

Ruxmon.2013-08.-.CodeBro!