This presentation demonstrates how NoSQL technologies were used to solve a difficult analytical problem that traditional SQL databases could not. PDF malware has been on the rise for the past few years and has become one of the most successful methods for attackers to gain unauthorized access into a network. The standard way to do malware analysis in PDF documents has been very independent in nature. Commercial entities do not share their data so researchers must fend for themselves and more often than not, researchers analyze a PDF file independent of other malicious PDF files. I found this static approach to be highly inefficient, but storing multiple PDF documents in a database was a problem in itself. Traditional SQL databases didn’t seem like the right fit given their forced constraints and true relational models.
PDF files also contain a lot of dynamic data that make them a tough fit in a traditional SQL model. PDF A could contain 40 objects where as PDF B could contain 3,000 objects. Scaling this out becomes quite difficult and messy. When looking at this problem, I ideally wanted to solve a number of issues at once. I needed a good way to share a PDF samples, an easy way to query on a corpus of documents and the ability to efficiently get my data back out so I could display it elsewhere.
With all my samples in a JSON format, MongoDB just made sense as it could take in these objects and allow me to query on them as a whole or independently. MongoDB also provided me with a rich tool-set to further answer questions that had never been posed before. By using single and multi-step map/reduce jobs, I was able to aggregate PDF characteristics and apply simple averaging to identify shared commonalities between malicious documents.
Though I have had great outcomes and successes with MongoDB, there have also been annoyances and the unexplained details. These issues pinned me against a wall for days and sometimes had me wondering if I had picked the wrong model for tackling this problem. At the end of the day though I was able to overcome these issues and account for them without hassle.
This talk will cover new research methods and tools created using NoSQL technologies to analyze PDF documents in a more efficient manner that promotes collaboration among the community. It will also serve as a step forward in detecting malicious PDFs by looking at them from a statistical standpoint. When I look back on the choice to use MongoDB, I think I made a great decision. I can’t begin to think how I would have handled processing thousands of unique named function calls with multiple attributes for each PDF or running multi-step map/reduce jobs against a Mongo document instead of a blob in a SQL database. MongoDB provided me with a rich functionality that easily led me to success on my project.
2. Who I Am
¤ Security Researcher
¤ GWU CERT 9b+
¤ Past
¤ Security Consultant @ G2, Inc.
¤ SMT/Network Engineer @ Windermere
¤ Focus
¤ PDF Malware Analysis
¤ Messaging Technologies
3. Agenda
¤ PDF Woes
¤ Overview of PDF X-RAY
¤ Why Mongo?
¤ Challenges
¤ Old Questions with New Answers
¤ Conclusions
4. The PDF Problem
• Extremely diverse and
flexible
• Heavily used by
attackers
• Difficult to parse and
identify malicious
content
• Widely distributed
http://www.zdnet.com/blog/security/study-6-out-of-every-10-users-run-vulnerable-adobe-reader/9014
5. Oh, but there’s more!
PDF A (200KB) PDF B (3MB)
¤ 11 Objects ¤ 300 Objects
¤ 291 Names/Dicts ¤ 158 Names/Dicts
¤ Metadata ¤ Partial metadata
¤ Filters for compression ¤ No filters applied
¤ Embedded documents ¤ References to outside sites
¤ Multiple updates ¤ Single update
6. PDF X-RAY
Process Stats
1. Upload PDF ¤ 10 collections
2. Convert to JSON ¤ ~80K documents
3. Store in Mongo ¤ ~3GB of data
4. Create report ¤ PyMongo
5. User is informed ¤ Mongo
¤ Django
7.
8. Why Mongo?
¤ JSON in, JSON out
¤ Documents treated independently
¤ No fixed layout or schema
¤ MapReduce power
¤ Heard it was web-scale
9.
10. Why Mongo?
¤ JSON in, JSON out
¤ Documents treated independently
¤ No fixed layout or schema
¤ MapReduce power
¤ Heard it was web-scale
18. MapReduce Fun
¤ How many unique named functions occur across all
malicious documents compared to good ones?
¤ Despite not being required, is metadata ever useful as an
indicator when classifying a PDF?
¤ What can we glean about Anti-Virus software based on
stored reports (assuming they are available)?
20. MapReduce Fun
¤ How many unique named functions occur across all
malicious documents compared to good ones?
¤ Despite not being required, is metadata ever useful as an
indicator when classifying a PDF?
¤ What can we glean about Anti-Virus software based on
stored reports (assuming they are available)?
22. MapReduce Fun
¤ How many unique named functions occur across all
malicious documents compared to good ones?
¤ Despite not being required, is metadata ever useful as an
indicator when classifying a PDF?
¤ What can we glean about Anti-Virus software based on
stored reports (assuming they are available)?
24. Gripes
¤ Size limits
¤ MapReduce troubleshooting
¤ Restoring collections into each other (no upsert)
¤ Getting the entire object back on specific queries
¤ Lack of triggers
25. Kudos 10gen
¤ Size limits keep getting bumped up
¤ Output shaping is in testing
¤ Simple aggregation through queries is in testing
¤ Google group responses are quick