2. Who am I ?
➔ Christophe Alladoum
➔ IOActive pirate
➔ blah blah blah
3. What about ?
➔ I read a LOT of code
◆ mostly for fun (eventually for work)
● just to know how it works
● occasionally to find bugs
◆ most of the time, C code
● sometimes C++
● occasionally higher level stuff: PHP (lol), Java,
Python, ...
4. What about ?
➔ C code is tricky & not trivial
● many standards (ANSI C - C89, C99, C11, etc..)
● many bad coding practices
● MANY subtleties in the language
➔ Ergo, many places for flaws
● logic errors
● programming errors
● lack of restriction in code (buffers, integers)
I like
5. Existing automated tools
● Many Open-Source & licenced ($$$) tools use regexp to
find weak patterns
● Insufficient approach :
○ Example using latest flawfinder :
○ Basically as clever as making a `grep`
which is one of the best vuln finder btw
Ok, thanks !
6. Existing automated tools
○ and (too) many times, there are “strange” results
○ Usually a very *bad* idea to just paste output from
those tools in a (serious) code review report
*PLUS* splint fails to
see vulnerable calls
7. A smarter approach
➔ C based code projects are ultimately made
to be compiled & linked
◆ Compilers are the best code reviewers !!
● Code is parsed and transformed into another format
● Code is validated
● Some additional checks are even provided by default for
programming errors (type checks, unused vars, invalid
formatted strings, uninitialized values, etc…)
8. Quick reminder on compilers
● Compiler, noun : set of programs that transforms source code written in a
programming language into another computer language (Wikipedia).
■ Examples : GCC, as, Python ( which embeds a JIT compiler), etc...
● Abstract representation of compiler behavior:
9. LLVM Specifics
● What makes LLVM so special ?
○ LLVM (Low-Level Virtual Machine) : 13 year old project
○ Many different projects around this architecture
○ LLVM structure *truly* isolates each part
(lexing/optimizing/generating)
○ Totally Plug-and-Play
● you can easily write a lexer for generating Python .pyc file ...
● … or you can use optimizer API to help runtime bug detection (heard of Google
AddressSanitizer module ?) …
● … or you can use an existing parser (for instance GCC’s) and bind it to the rest
of the LLVM architecture (llvm-gcc)
→ really cool features ! Go
hack it !!
10. LLVM Specifics
● Clang
○ Default C/C++/Obj-C compiler based for LLVM architecture
○ Parser gets .c, .cpp, .m files as input and generates an
Intermediate Representation (IR) of the code
→ this is achieved thanks to an Abstract Syntax Tree (AST)
created when “reading” each source file
○ An API is provided to interact with the generated AST
→ in native C++
→ or higher languages, like Python
■ This means that Clang parses the code for us, then why not use
this to parse code in a smart way (and ultimately find
vulnerabilities) ?
11. Clang Python API
● Relatively easy to use...
○ … but not enough thoroughly documented (just automatically generated documentation)
→ pydoc works fairly well on it
○ Many blog posts (but sometimes outdated on the topic)
○ Namespace fairly intuitive
Basic example : outputs
12. Demo
● clang-draw-ast.py is a 70-line Python script that will parse a C source
file and display (PNG format) the corresponding AST.
13. (This is the expected result if live demo fails)
15. The magic inside
Indexation engine API is exposed by `clang.cindex` package.
● Index
○ top-level object which manages some global library state.
● TranslationUnit
○ High-level object encapsulating the AST for a single translation unit
(parsed on the fly)
● SourceRange, SourceLocation, and File
○ Objects representing information about the input source.
16. Clang internals voodoo
The routines in this group provide the
ability to create and destroy translation
units from files, either by parsing the
contents of the files or by reading in a
serialized representation of a
translation unit.
● Once indexation engine is created, parse() function
will output a TranslationUnit object
○ The most important object
● Cursor object that will iterate through all nodes
○ kind : declare the type of the current node
○ displayname : display name for the entity referenced
○ location : returns the source location (the starting
character)
○ get_children() : return an iterator for accessing the children of
this cursor
○ get_arguments(): return an iterator for accessing the arguments
of this cursor
18. Pros / Cons
Pros
● simple and intuitive Python bindings
● full control over all the code being audited
● parsing and browsing are fast
● can be extended with LLVM extra modules
Cons
● generated over Python ctypes : might not work as well for other high
level languages (Ruby, Java, etc.)
Limitations ?
● Many developments, API keeps on improving and docs becoming more
complete
19. Introducing CodeBro!
● Built as a Proof-of-Concept around this idea
○ Meaning : you can use it but don’t rely on it
● Underlying idea : create a web-based tool that would interface between
AST and code reviewer
○ Code reviewer can smartly analyse/navigate through code and
eventually add some modules to detect basic (or advanced)
vulnerabilities
20. CodeBro!
● 100% Open-Source
○ Beer-Ware License
● 100% full Python
● (Hopefully) Easily installable (pip)
● Django (compat. 1.5+) based application
○ combines many cool Python based technologies
■ PyDot
■ PyCharm
■ Pygments
■ etc.
○ Allows to keep things simple
■ 1 project to audit = 1 specific database (default : SQLite)
21. CodeBro!
● Uses Clang parsing module to dynamically
interact with code
○ Cross-referencing feature similar to IDA Pro
■ only between functions (caller/callee)
○ call graphs generation : visual understanding of code
■ SVG generated graph → can be browsed through browser
22. CodeBro!
● “Analysis” module
○ reports all default diagnostics provided by Clang
○ provides a “Plugin” API
■ some modules implemented
■ … some more to come
23. CodeBro!
● Extensible through plugins
○ can use AST and/or already existing references
○ Examples :
■ detecting dead code
● find all functions never called (i.e. no down Xref to it)
■ improving format string flaws detection
● “count” number of args for known functions (printf, sprintf,
etc.) and parse the arguments
● detect formatted string wrapping functions (based on former
calls)
■ (in a limited extent)
detect use-after-free like this →
28. Future enhancements
● Still a work in progress
● Fix bugs
● Index all components of source files (instead of just CALL_EXPR and
FUNCTION_DECL)
● Improve search engine
● Add macro parsing
● Integrate more source code input vector (GIT - as soon as there is a decent
Python GIT bindings package)
● Improve C++ and Objective-C analysis
● Add moar modulez !!