Monkey-Spider is a low-interaction honeyclient crawler that analyzes websites for malicious content. It uses existing open source software like Heritrix for crawling and ClamAV for malware scanning. The tool was created to build a database of internet threats through broad, automated analysis of millions of websites. Preliminary results from Monkey-Spider crawls found that 1% of sites contained malware, with most infections found on pirate and wallpaper sites. Future work may include detecting advanced evasion techniques used by malware and exploiting client programs beyond web browsers.
1. Monkey-Spider
Detecting Malicious Websites with
Low-Interaction Honeyclients
Ali Ikinci Thorsten Holz Felix Freiling
ali.ikinci(at)contentkeeper.com
{holz|freiling}(at)informatik.uni-mannheim.de
presented at
2. Outline
➔ Problem and related work
➔ Challenge and requirements
analysis
➔ Honeypots and honeyclients
➔ Monkey-Spider and its limitations
➔ Preliminary results
➔ Key Findings
➔ Future Trends
Monkey-Spider 2
3. Malicious Web sites ...
● Are Web sites which could be a threat to the
security of the client computers requesting them
● Even a visit without any other interaction of such
could be a threat (so called drive-by downloads)
● Such Web sites can ...
● host all sorts of malware and malicious code
● exploit browser vulnerabilities
● exploit vulnerabilities of other client software
● install backdoors, spyware or keyloggers
● steal confidential information
Monkey-Spider 3
4. The Problem continued
● No comprehensive, up-to-date and free
database of threats on the Internet
● Every Web site could serve malicious deeds
even trusted ones
● Manual malware analysis of malicious Web
sites is too slow and too expensive
● Even automatic analysis is often too slow to
cover millions of Web sites
Monkey-Spider 4
5. Related Work
a) dedicated honeyclient
b) Browsing tool to use
normal Web users
PCs as honeyclients
c) Off line honeyclient
like code analyzer
d) Browsing tool to
control access to
malicious Web sites
e) Database
Monkey-Spider 5
a b c d e
Caffeine Monkey X
Capture – HPC X
X X
X X
X
X
X
X X X
X X
MITRE Honeyclient X
Monkey-Spider X
X
SHELIA X
X
? X X
X X
Web exploit finder X
Explabs LinkScanner
Finjan SecureBrowsing
Firekeeper
HoneyC
Malzilla
McAfee SiteAdvisor
Microsoft HoneyMonkey
Phoneyc
SpyBye
TrendMicro TrendProtect
UW Spycrawler
6. Challenge
● Fast and broad scope analysis of
millions of resources on the Internet
● Find actual threats and zero-day
exploits on the Internet
● Collect malicious code
● Allow various infection vectors
● Build a database with detailed relevant
information about threats
● Continuous monitoring of suspicious
resources
Monkey-Spider 6
8. Requirements Analysis
• Overall Requirements
– Performance!
– Modularity and multi threaded modules
– Expandability
– Scalability
– Logging and statistics
• Crawler
– Crawling policies
– Link extraction
– URL normalization
– Efficient storage
Monkey-Spider 8
9. Requirements Analysis
• Malware scanner
– Multiple malware scanners
– Support for automated malware analysis
tools
– Client side scripting support
• JavaScript, VBScript, ActionScript ...
– Client software support
• Media Players, Office Applications,
Acrobat Reader ...
Monkey-Spider 9
10. Solution ideas
● Do not reinvent the wheel
● Use existing Free Software
● Use existing honeypot
techniques
● Use extensive prototyping
● Only superficial detection
Monkey-Spider 10
11. Honeypots
● Honeypots are dedicated
deception devices
● Two types:
– server honeypots or
honeypots
– client honeypots or
honeyclients
• Both can be classified as:
– low-interaction honeypots or
– high-interaction honeypots
Monkey-Spider 11
12. Our Solution - The Monkey-Spider
● A crawler based low-interaction
honeyclient
● Started as a diploma thesis in 2006
● Available under the GPL at
http://monkeyspider.sourceforge.net
● Written in Python
● Makes use of Heritrix, Postgresql, ClamAV,
Web Services
● Command line tool set for the analysis of
crawled content
Monkey-Spider 12
14. Monkey-Spider - Queue Generation
● Provide starting point(s) (seeds)
utilizing different approaches:
– Web search seeders (MSN and Yahoo)
– (Spam) mail seeder
– Hosts file seeder
• Future seeders might include
– Monitoring seeder
– Typo squatting seeder
Monkey-Spider 14
15. Monkey-Spider - Malware Scanner
● ARC-Files are unpacked and
examined
● MW-Scanners are executed on
crawled content
– Found malware is stored for optionally
further research
• Information regarding the malware is
stored into database
Monkey-Spider 15
Sample of extracted file names
16. Limitations for now
● Analysis is limited to the publicly indexable Web
● Only known malware is recognized and stored
● Drive-by download sites, heavily obfuscated
JavaScript
● Zero-day exploits are not recognized
● Full scan of the Web is not possible with Heritrix
(yet?)
● Two separate jobs are not yet aware of examining
the same sites and contents
Monkey-Spider 16
17. Preliminary Results
● We have done various crawls over two months
during March and April 2007
● We crawled for various topics and did a hosts file
based crawl
● Defective crawl settings caused incomplete
preliminary results
Monkey-Spider 17
MIME-type distribution of crawled content:
19. Performance
● Measurements on a standard PC
● Not focused on a Web site but on throughput
● Crawl performance of 1 MB/sec
● Malware analysis (without the crawling) in
0.05 seconds per downloaded content and
2.35 seconds per downloaded and
compressed MB
● Resulting in about 3.35 seconds per analyzed
MB of content
● In comparison:
● other low-interaction honeyclients require a
minimum of 3 seconds per Web site
Monkey-Spider 19
20. Key Findings
● 1% of all examined Web sites are
malicious
● Adult Web sites are relatively harmless
● Most malware is spread through pirate and
wallpaper propagation Web sites
● A Web site has to be completely crawled
and analyzed to gather representative
results
● The scope of the crawl has to be chosen
carefully
● We know very little about malicious Web
sites and their operators
Monkey-Spider 20
21. Future Trends
● Attacks are concentrated more and more
from the server to the client
● Client programs other than the Web client
are targeted more often, like Media
Players, Flash and PDF interpreters
● Advanced honeypot, virtual machine and
anti-virus program detection techniques
contained in malware complicates the
detection of such
● Web exploitation kits who build an
infrastructure for Web based attacks are
on the rise
Monkey-Spider 21