2. Background
• Client was being sued for Copyright Infringement
• Client’s lawyer wanted two questions answered
• Does the code contain any open source or GPL code?
• When was the code in question written?
• Code was written in PHP (web-based application)
• Code had absolutely no comments
• No copyright headers
• No dates of any kind
www.504ensics.com
3. Goal
• If it can be proven that the code contains open
source or GPL code with restrictive licenses then
the claim in invalid
• If it can be proven that the copyright code on file
was written after the author’s claimed “creation
date”, Copyright is invalid
www.504ensics.com
4. Is code original?
• No comments or header’s that would imply
authorship
• Code didn’t look familiar
• Code was kind of crappy
www.504ensics.com
5. Step 1 – Acquire Samples
• Wrote Python script to download all projects
written in PHP from Github
• Scraped from search feature
• Limited to 50 pages of search
• Got something like 10GB of compressed code
• ~100,000 files
www.504ensics.com
7. Fuzzy Hashing
• Vassil says I have to call it “Approximate Matching”
• Ssdeep
• Vassil Roussev & Candace Quates
• Free, Open Source
• Awesome
• Traditional hashing
• If a single bit of the input changes, the whole hash
changes
• Fuzzy Hashing
• Compares files and gives similarity index
• Can find “similar” files
www.504ensics.com
8. When was code written?
• We can invalidate copyright if the sample on file
was written after the claimed authorship date
• No comments or dates of any kind in the code!
• No access to developer’s workstation to do
traditional forensics
• ???
www.504ensics.com
9. PHP
• Web-based language
• Updated reasonably frequently
• New Features added often
• Goal
• Determine which features were used in the code
• Correlate features with PHP release date
• Code couldn’t have been written before this date
www.504ensics.com
10. Step 1 – Function Use
• Programmer can create own functions or use ones
available in the language
• Ex
• function plus_one($x) { return $x + 1; }
• Python script to find all function declarations and
calls
• Ignore declared functions
• Left with a list of language “features” used
www.504ensics.com
11. Step 2 – Version Detection
• PHP comes with auto-generated documentation
about each built-in function
• Documentation says which version each function
became first available
• Write python script to scrape PHP documentation
• Correlate functions with PHP versions
• We only care about the function with the newest
version
www.504ensics.com
12. Step 3 – Date the code
• PHP has an archive of release notes on their
website
• Contains release versions and dates
• Python script scrapes release notes for the PHP
version of interest and gives us the release date
• Reasonably, the code couldn’t have been written
before that date
www.504ensics.com
13. Step 4 – Profit
• Win!
• Code in question used features first available in
PHP 5.1.5
• Release date 17-Aug-2006
• This was after the claimed creation date
www.504ensics.com
14. Conclusion
• Sometimes you can’t depend solely on existing
tools
• Learn to program even if you’re not a
“programmer”
• PHP sucks
• Fuzzy Hashing and Python is Cool
www.504ensics.com