Case Study: Using Mongo and MapReduce to Analyze a Difficult Research Problem

•

2 gefällt mir•675 views

This presentation demonstrates how NoSQL technologies were used to solve a difficult analytical problem that traditional SQL databases could not. PDF malware has been on the rise for the past few years and has become one of the most successful methods for attackers to gain unauthorized access into a network. The standard way to do malware analysis in PDF documents has been very independent in nature. Commercial entities do not share their data so researchers must fend for themselves and more often than not, researchers analyze a PDF file independent of other malicious PDF files. I found this static approach to be highly inefficient, but storing multiple PDF documents in a database was a problem in itself. Traditional SQL databases didn’t seem like the right fit given their forced constraints and true relational models. PDF files also contain a lot of dynamic data that make them a tough fit in a traditional SQL model. PDF A could contain 40 objects where as PDF B could contain 3,000 objects. Scaling this out becomes quite difficult and messy. When looking at this problem, I ideally wanted to solve a number of issues at once. I needed a good way to share a PDF samples, an easy way to query on a corpus of documents and the ability to efficiently get my data back out so I could display it elsewhere. With all my samples in a JSON format, MongoDB just made sense as it could take in these objects and allow me to query on them as a whole or independently. MongoDB also provided me with a rich tool-set to further answer questions that had never been posed before. By using single and multi-step map/reduce jobs, I was able to aggregate PDF characteristics and apply simple averaging to identify shared commonalities between malicious documents. Though I have had great outcomes and successes with MongoDB, there have also been annoyances and the unexplained details. These issues pinned me against a wall for days and sometimes had me wondering if I had picked the wrong model for tackling this problem. At the end of the day though I was able to overcome these issues and account for them without hassle. This talk will cover new research methods and tools created using NoSQL technologies to analyze PDF documents in a more efficient manner that promotes collaboration among the community. It will also serve as a step forward in detecting malicious PDFs by looking at them from a statistical standpoint. When I look back on the choice to use MongoDB, I think I made a great decision. I can’t begin to think how I would have handled processing thousands of unique named function calls with multiple attributes for each PDF or running multi-step map/reduce jobs against a Mongo document instead of a blob in a SQL database. MongoDB provided me with a rich functionality that easily led me to success on my project.

Technologie

Malware, Mongo and Map Reduce
Brandon Dixon – 9b+

Who I Am

¤  Security Researcher

¤  GWU CERT 9b+
¤  Past
¤  Security Consultant @ G2, Inc.
¤  SMT/Network Engineer @ Windermere

¤  Focus
¤  PDF Malware Analysis
¤  Messaging Technologies

Agenda

¤  PDF Woes

¤  Overview of PDF X-RAY

¤  Why Mongo?

¤  Challenges

¤  Old Questions with New Answers

¤  Conclusions

The PDF Problem
•  Extremely diverse and
flexible

•  Heavily used by
attackers

•  Difficult to parse and
identify malicious
content

•  Widely distributed

http://www.zdnet.com/blog/security/study-6-out-of-every-10-users-run-vulnerable-adobe-reader/9014

Oh, but there’s more!

PDF A (200KB) PDF B (3MB)
¤  11 Objects ¤  300 Objects

¤  291 Names/Dicts ¤  158 Names/Dicts

¤  Metadata ¤  Partial metadata

¤  Filters for compression ¤  No filters applied

¤  Embedded documents ¤  References to outside sites

¤  Multiple updates ¤  Single update

PDF X-RAY

Process Stats
1.  Upload PDF ¤  10 collections

2.  Convert to JSON ¤  ~80K documents

3.  Store in Mongo ¤  ~3GB of data

4.  Create report ¤  PyMongo

5.  User is informed ¤  Mongo

¤  Django

Why Mongo?

¤  JSON in, JSON out

¤  Documents treated independently

¤  No fixed layout or schema

¤  MapReduce power

¤  Heard it was web-scale

http://www.youtube.com/watch?v=b2F-DItXtZs

MapReduce Fun

¤  How many unique named functions occur across all
malicious documents compared to good ones?

¤  Despite not being required, is metadata ever useful as an
indicator when classifying a PDF?

¤  What can we glean about Anti-Virus software based on
stored reports (assuming they are available)?

Gripes

¤  Size limits

¤  MapReduce troubleshooting

¤  Restoring collections into each other (no upsert)

¤  Getting the entire object back on specific queries

¤  Lack of triggers

Kudos 10gen

¤  Size limits keep getting bumped up

¤  Output shaping is in testing

¤  Simple aggregation through queries is in testing

¤  Google group responses are quick

Questions and Contact
Brandon Dixon

brandon@9bplus.com

www.9bplus.com

blog.9bplus.com

www.pdfxray.com

@9bplus

Empfohlen

HDFSTomás Fernández Pena

Introducción a Apache Spark a través de un caso de uso cotidianoSocialmetrix

Hadoop Real Life Use Case & MapReduce DetailsAnju Singh

Big Data Real Time Analytics - A Facebook Case StudyNati Shalom

Big data, map reduce and beyonddatasalt

Dynamo and BigTable - Review and ComparisonGrisha Weintraub

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY

Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY

Empfohlen

HDFSTomás Fernández Pena

Introducción a Apache Spark a través de un caso de uso cotidianoSocialmetrix

Hadoop Real Life Use Case & MapReduce DetailsAnju Singh

Big Data Real Time Analytics - A Facebook Case StudyNati Shalom

Big data, map reduce and beyonddatasalt

Dynamo and BigTable - Review and ComparisonGrisha Weintraub

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY

Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY

Exploring Levels of Data LiteracyDATAVERSITY

Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY

Make Data Work for YouDATAVERSITY

Data Catalogs Are the Answer – What is the Question?DATAVERSITY

Data Catalogs Are the Answer – What Is the Question?DATAVERSITY

Data Modeling FundamentalsDATAVERSITY

Showing ROI for Your Analytic ProjectDATAVERSITY

How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY

Is Enterprise Data Literacy Possible?DATAVERSITY

The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY

Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY

Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY

Data Governance Trends and Best Practices To Implement TodayDATAVERSITY

2023 Trends in Enterprise AnalyticsDATAVERSITY

Data Strategy Best PracticesDATAVERSITY

Who Should Own Data Governance – IT or Business?DATAVERSITY

Data Management Best PracticesDATAVERSITY

MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY

Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...DATAVERSITY

Empowering the Data Driven Business with Modern Business IntelligenceDATAVERSITY

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Weitere ähnliche Inhalte

Mehr von DATAVERSITY

Exploring Levels of Data LiteracyDATAVERSITY

Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY

Make Data Work for YouDATAVERSITY

Data Catalogs Are the Answer – What is the Question?DATAVERSITY

Data Catalogs Are the Answer – What Is the Question?DATAVERSITY

Data Modeling FundamentalsDATAVERSITY

Showing ROI for Your Analytic ProjectDATAVERSITY

How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY

Is Enterprise Data Literacy Possible?DATAVERSITY

The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY

Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY

Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY

Data Governance Trends and Best Practices To Implement TodayDATAVERSITY

2023 Trends in Enterprise AnalyticsDATAVERSITY

Data Strategy Best PracticesDATAVERSITY

Who Should Own Data Governance – IT or Business?DATAVERSITY

Data Management Best PracticesDATAVERSITY

MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY

Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...DATAVERSITY

Empowering the Data Driven Business with Modern Business IntelligenceDATAVERSITY

Mehr von DATAVERSITY (20)

Exploring Levels of Data Literacy

Building a Data Strategy – Practical Steps for Aligning with Business Goals

Make Data Work for You

Data Catalogs Are the Answer – What is the Question?

Data Catalogs Are the Answer – What Is the Question?

Data Modeling Fundamentals

Showing ROI for Your Analytic Project

How a Semantic Layer Makes Data Mesh Work at Scale

Is Enterprise Data Literacy Possible?

The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...

Emerging Trends in Data Architecture – What’s the Next Big Thing?

Data Governance Trends - A Look Backwards and Forwards

Data Governance Trends and Best Practices To Implement Today

2023 Trends in Enterprise Analytics

Data Strategy Best Practices

Who Should Own Data Governance – IT or Business?

Data Management Best Practices

MLOps – Applying DevOps to Competitive Advantage

Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...

Empowering the Data Driven Business with Modern Business Intelligence

Kürzlich hochgeladen

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

🐬 The future of MySQL is Postgres 🐘RTylerCroy

A Domino Admins Adventures (Engage 2024)Gabriella Davis

How to convert PDF to text with Nanonetsnaman860154

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

A Call to Action for Generative AI in 2024Results

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Kürzlich hochgeladen (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Presentation on how to chat with PDF using ChatGPT code interpreter

🐬 The future of MySQL is Postgres 🐘

A Domino Admins Adventures (Engage 2024)

How to convert PDF to text with Nanonets

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Driving Behavioral Change for Information Management through Data-Driven Gree...

Boost PC performance: How more available memory can improve productivity

2024: Domino Containers - The Next Step. News from the Domino Container commu...

A Call to Action for Generative AI in 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Scaling API-first – The story of a global engineering organization

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

CNv6 Instructor Chapter 6 Quality of Service

08448380779 Call Girls In Friends Colony Women Seeking Men

Case Study: Using Mongo and MapReduce to Analyze a Difficult Research Problem

1. Malware, Mongo and Map Reduce Brandon Dixon – 9b+

2. Who I Am ¤  Security Researcher ¤  GWU CERT 9b+ ¤  Past ¤  Security Consultant @ G2, Inc. ¤  SMT/Network Engineer @ Windermere ¤  Focus ¤  PDF Malware Analysis ¤  Messaging Technologies

3. Agenda ¤  PDF Woes ¤  Overview of PDF X-RAY ¤  Why Mongo? ¤  Challenges ¤  Old Questions with New Answers ¤  Conclusions

4. The PDF Problem •  Extremely diverse and flexible •  Heavily used by attackers •  Difficult to parse and identify malicious content •  Widely distributed http://www.zdnet.com/blog/security/study-6-out-of-every-10-users-run-vulnerable-adobe-reader/9014

5. Oh, but there’s more! PDF A (200KB) PDF B (3MB) ¤  11 Objects ¤  300 Objects ¤  291 Names/Dicts ¤  158 Names/Dicts ¤  Metadata ¤  Partial metadata ¤  Filters for compression ¤  No filters applied ¤  Embedded documents ¤  References to outside sites ¤  Multiple updates ¤  Single update

6. PDF X-RAY Process Stats 1.  Upload PDF ¤  10 collections 2.  Convert to JSON ¤  ~80K documents 3.  Store in Mongo ¤  ~3GB of data 4.  Create report ¤  PyMongo 5.  User is informed ¤  Mongo ¤  Django

8. Why Mongo? ¤  JSON in, JSON out ¤  Documents treated independently ¤  No fixed layout or schema ¤  MapReduce power ¤  Heard it was web-scale

10. Why Mongo? ¤  JSON in, JSON out ¤  Documents treated independently ¤  No fixed layout or schema ¤  MapReduce power ¤  Heard it was web-scale

11. = { one document to rule them all. }

12. Why Mongo? ¤  JSON in, JSON out ¤  Documents treated independently ¤  No fixed layout or schema ¤  MapReduce power ¤  Heard it was web-scale

13. No Thanks

14. Why Mongo? ¤  JSON in, JSON out ¤  Documents treated independently ¤  No fixed layout or schema ¤  MapReduce power ¤  Heard it was web-scale

15.

16. Why Mongo? ¤  JSON in, JSON out ¤  Documents treated independently ¤  No fixed layout or schema ¤  MapReduce power ¤  Heard it was web-scale

17. http://www.youtube.com/watch?v=b2F-DItXtZs

18. MapReduce Fun ¤  How many unique named functions occur across all malicious documents compared to good ones? ¤  Despite not being required, is metadata ever useful as an indicator when classifying a PDF? ¤  What can we glean about Anti-Virus software based on stored reports (assuming they are available)?

19. Question #1: Results Good Bad

20. MapReduce Fun ¤  How many unique named functions occur across all malicious documents compared to good ones? ¤  Despite not being required, is metadata ever useful as an indicator when classifying a PDF? ¤  What can we glean about Anti-Virus software based on stored reports (assuming they are available)?

21. Question #2: Results Good Bad

22. MapReduce Fun ¤  How many unique named functions occur across all malicious documents compared to good ones? ¤  Despite not being required, is metadata ever useful as an indicator when classifying a PDF? ¤  What can we glean about Anti-Virus software based on stored reports (assuming they are available)?

23. Question #3: Results

24. Gripes ¤  Size limits ¤  MapReduce troubleshooting ¤  Restoring collections into each other (no upsert) ¤  Getting the entire object back on specific queries ¤  Lack of triggers

25. Kudos 10gen ¤  Size limits keep getting bumped up ¤  Output shaping is in testing ¤  Simple aggregation through queries is in testing ¤  Google group responses are quick

26. Questions and Contact Brandon Dixon brandon@9bplus.com www.9bplus.com blog.9bplus.com www.pdfxray.com @9bplus