Chaz Lever, Georgia Institute of Technology
Both the operational and academic security communities have used dynamic analysis sandboxes to execute malware samples for roughly a decade. Network information derived from dynamic analysis is frequently used for threat detection, network policy, and incident response. Despite these common and important use cases, the efficacy of the network detection signal derived from such analysis has yet to be studied in depth. This paper seeks to address this gap by analyzing the network communications of 26.8 million samples that were collected over a period of five years.
Using several malware and network datasets, our large-scale study makes three core contributions. (1) We show that dynamic analysis traces should be carefully curated and provide a rigorous methodology that analysts can use to remove potential noise from such traces. (2) We show that Internet miscreants are increasingly using potentially unwanted programs (PUPs) that rely on a surprisingly stable DNS and IP infrastructure. This indicates that the security community is in need of better protections against such threats, and network policies may provide a solid foundation for such protections. (3) Finally, we see that, for the vast majority of malware samples, network traffic provides the earliest indicator of infection—several weeks and often months before the malware sample is discovered. Therefore, network defenders should rely on automated malware analysis to extract indicators of compromise and not to build early detection systems.
Axa Assurance Maroc - Insurer Innovation Award 2024
BlueHat v17 || A Lustrum of Malware Network Communication: Evolution and Insights
1. A LUSTRUM OF MALWARE
NETWORK COMMUNICATION
EVOLUTION AND INSIGHTS
C H A Z L E V E R , P h D C A N D I DAT E
2. WHY DO WE CARE ABOUT MALWARE?
What is malware?
• Quite simply it is malicious software (e.g., viruses, spyware,
ransomware, and adware).
Why do we care?
• Used for illicit activities that affect individuals, enterprises, and
even governments.
• Reverse engineering malware is the foundation upon which
numerous security defenses are based.
3. WHAT IS MALWARE ANALYSIS?
What is malware analysis?
• Process of studying the functionality and potential impact of
malware samples.
• Static analysis examines malware without executing it.
• Dynamic analysis examines malware by running it in controlled
sandbox.
Why is it important?
• This is how indicators of compromise (IOCs) and other
information are derived from actual malware samples.
4. MORE MALWARE, MORE PROBLEMS?
• Cyber attacks are on the rise.
• Malware has been at the center
of a number of these attacks.
• Despite access to more malware
samples than ever, malware
based security products did not
prevent threats.
7. A LUSTRUM OF MALWARE
What did we do?
• Study the network signal extracted from malware over a
half decade.
What are we trying to understand?
• Is malware effective for use with early warning systems?
• What are limitations of systems that rely on malware
samples for defense?
9. MALWARE CLASSIFICATION
What’s the goal?
• Cluster AV labels from VirusTotal based on family.
• Link each family with a both a type and queried e2LDs.
• Will use this information to provide extra context in later analysis.
What did we do?
• Modify AVClass1 to spit out a type (i.e., PUP or malware) for each
sample.
• Ran over our dataset of 23.9M VirusTotal reports.
9
11. CLASSIFICATION RESULTS
• There are more malware families, but PUP families tend to
have more samples per family.
11
Top Families by Sample Top Families by e2LD
• Malware families tend to have more e2LDs per sample,
indicating greater domain polymorphism.
12. CLEANING UP DATASETS
Invalid Domains
• Remove NX Domains to reduce the effects of Domain Generation Algorithms (DGA)
• Reduction from 6.8M to 1.31M e2LDs
Benign Domains
• Remove popular domains from Alexa
• Remove known content delivery networks (CDN)
• Manually whitelist remaining domains
• Reduction from 1.31M to 1.29M e2LDs
Spam Domains
• Remove resolutions from binaries with lots of MX lookups
• Remove resolutions with mail related keywords (e.g., mail, smtp, imap)
• Reduction from 1.29M to 329,348 e2LDs
Reverse Zone Delegations
• Remove reverse delegations, which often result from system level processes and introduce lots of
noise.
• Reduction from 329,348 to 327,514 e2LDs
12
13. DOMAIN POLYMORPHISM
• Most malware samples resolve fewer than 10 unique full qualified
domains (FQDNs).
• Most registered domains only queried by a single, unique
malware sample.
• Evasion appears to happen on the registered domain.
Blacklisting domains may do little to prevent future
communication from new samples.
13
subdomain.example.com
14. MALWARE QUERYING DYNAMIC DNS
• Evasion happens on the child label.
• Queried 8.6M (32%) distinct samples in our dataset.
14
Description: The Top 100 most popular Dynamic DNS domains queried by malware
samples.
15. MALWARE QUERYING CDNS
• Most popular CDNs are the usual suspects.
• Malware communication is hiding in plain site.
15
Description: Complete list of all known CDN domains queried by malware samples in
our dataset.
16. MALWARE QUERYING DGA DOMAINS
• Over 12.5M (46%) of
malware samples contained
at least one NX domain.
16
• Before filtering, we found
that 3M (44%) of all
domains were in DGArchive.
• After filtering, we found that
55,396 (17%) of filtered
domains were in DGArchive.
17. MALWARE QUERYING SPAM DOMAINS
• Most spam related malware
samples queried hundreds or
thousands of MX domains.
17
• Most popular spam related
sample (i.e., MyDoom) is
over a decade old.
18. AN INCONVENIENT TRUTH
18
(a) pDNS (b) PBL
(c) Expired Domains
Description: Time difference
between a domain was first seen in
passive DNS, public blacklists, or an
expired domain list rather than
through dynamic malware analysis.
19. LIFETIME OF DOMAINS
19
(a) Malware (b) PUP (c) Unknown
Description: Joint distribution of domain lifetime and resolution frequency
observed in passive DNS for PUP, Malware, and Unclassified domains.
Notice similarities
22. KEY TAKE-AWAYS
• Waiting for malware to be discovered results in long
windows of vulnerability and potentially limited efficacy.
• Network defenses have the potential to identify threats
before the malware sample is discovered.
• Malware analysis is still extremely useful, but it’s
important to understand the limitations.
Malware is simply malicious software. Lots of different types but “malware” is the general umbrella under which they fall.
Malware is used to facilitate illicit activity on the Internet—affecting individuals, enterprises, and even nation states.
Give some examples of different types of malware abuse (Mirai botnet, banking trojans, etc.)
Defense frequently rely on malware analysis, and therefore, require the malware sample for future protection.
AVs will build signatures from malware analysis.
IDSes and blacklists will rely on the networks signal extracted from malware analysis.
- The world isn’t on fire, but there are definitely some potential pitfalls.
To our knowledge, largest such study performed.
Brought together a number of different datasets to better understand network communication from malware.
- Largest classification effort to date
LOTS of malware and increasing year over year.
Shows an increase in PUP samples over time—even overtaking number of malware samples in 2015.
Previous work Kotzias et. al shows the same trend but on much smaller datasets.
Work by Thomas et. al showed that Google Safe Browsing generates 3X as many detections for PUP as malware.
3,834 families with 10 samples are over 90% of all samples.
3,165 are malware
669 are PUP
PUP has an average of 16K samples per family.
Malware has an average of only 3.5K per family.
We identifed 36.5K malware e2LDs and 9.1K PUP e2LDs.
Found 718 e2LDs that account for 51,350 FQDNs
The most popular dynamic DNS domain is: dnsd.me
(owned by the dynamic DNS provider DNSdynamic [1]),
queried by 216,221 unique MD5s
service is not only free, but it offers unlimited registrations and an API for account management—making it very attractive for malware authors
The most popular CDNs did *not* appear to be lesser known, shady organizations!
The top five most queried CDN domains include akamai.net, edgesuite.net, cloudfront.net, netdna-cdn.com, and akadns.net.
The akamai.net domain alone:
is queried by 2,183,352 distinct malware samples
has 1,492 unique child labels under it.
DGAs are a prevalent form behavior seen across many different malware samples.
Account for a large number of domains from malware analysis
Illustrates the challenges of building blacklists from malware feeds
- Provides motivation for why we treated these different than other samples.
PBL
Only 30% of domains were added to PBLs before being observed in malware analysis.
20% were reported with a delay of over 500 days
Result is consistent with previous work by Kuhrer et. al where domains seen in DNS on average 384 days before PBL
Reputation systems have also shown the ability to discover threats faster than PBLs
pDNS
Long tail on left part of graph can be partially explained by malware relying on benign infrastructure such as dynamic DNS and CDN providers.
Potentially long setup phase.
Expired Domains
Has a pronounced effect on the right side of the graph (i.e., more domains seen after discovering malware sample)
Summary
Blacklists built from dynamic malware analysis will still be unaware of potential threats for several weeks or even months.
Malware/Unknown
Three separate hotspots: bottom left, top right, bottom right
Malware and unknown appear to show same behavior (i.e., unknown likely not PUP)
PUP
prevalence of PUP domains over last 2-3 years justifies [1000, 1000] bounding of the joint distribution
Organizations failing to block PUP domains
End-point security engines that do not manage to remediate PUP infections.
Summarizing
all three types of domains frequently have long domain lifetimes
many of those domains are frequently looked up.
most domains were only resolved by a single sample in Section V-A1
this suggests that many samples remain active on the Internet for extended periods of time.