The last five to ten years has seen massive advancements in open source Internet-wide mass-scan tooling, on-demand cloud computing, and high speed Internet connectivity. This has lead to a massive influx of different groups mass-scanning all four billion IP address in the IPv4 space on a constant basis. Information security researchers, cyber security companies, search engines, and criminals scan the Internet for various different benign and nefarious reasons (such as the WannaCry ransomware and multiple MongoDB, ElasticSearch, and Memcached ransomware variants). It is increasingly difficult to differentiate between scan/attack traffic targeting your organization specifically and opportunistic mass-scan background radiation packets.
Grey Noise is a system that records and analyzes all the collective omnidirectional background noise of the Internet, performs enrichments and analytics, and makes the data available to researchers for free. Traffic is collected by a large network of geographically and logically diverse “listener” servers distributed around different data centers belonging to different cloud providers and ISPs around the world.
In this talk I will candidly discuss motivations for developing the system, a technical deep dive on the architecture, data pipeline, and analytics, observations and analysis of the traffic collected by the system, business impacts for network operators, pitfalls and lessons learned, and the vision for the system moving forward.
3. About Me
Andrew Morris
Background in offensive
cyber stuff, security
research
Previously:
* Endgame R&D
* Intrepidus (NCC
Group)
* KCG (ManTech)
Twitter: @Andrew___Morris
4. Lots of people scan the Internet.
I built a system that collects all of the
Internet-wide scan traffic.
I analyze the data to find weird stuff.
I make that data available to researchers
for free via an API
6. Background
• Internet-wide mass scanning is easier than ever
• Open source tooling: Masscan, ZMap,
UnicornScan, etc
• Cloud computing
• Instant servers
• large amount of recyclable IP addresses
• High throughput / faster global Internet
connections
7. What is Internet Mass
Scanning?
• “Mass Scanning” is scanning every single
routable IP address on the Internet for
something
• The IPv4 address space is 0.0.0.0 –
255.255.255.255
• Give or take a few blocks
• That’s 4.2 billion IP addresses
• Bandwidth-wise, roughly same as uploading a 240
GB file
8. What does this mean?
• Lots and lots of people scanning the Internet,
for lots of different things
• From millions of different IP addresses
• Benign: Shodan, Censys, Sonar, ShadowServer
• Malicious: SSH/Telnet worms (Mirai), IOT worms,
CONFICKER, etc
• Internet-wide scanning is busier than ever
9. This creates a problem
When you see an IP scanning your network,
are they scanning you specifically or the
entire Internet?
When you see an IP attacking your network,
are they attacking you specifically or the
entire Internet?
10. Solution
• Collect all the omnidirectional Internet-wide
IPv4 scan/attack traffic
• Subtract those IPs/activity from your SIEM
• All the remaining activity is targeting you
11. But how?
• Stand up a large amount of servers in diverse
data centers with no business value
• No business value means that ANY traffic that hits
it is, by definition, opportunistic
• Instrument these servers with extremely
aggressive logging and small microservices
• Stream the logs of the scan/attack traffic to a
central place
• Analyze the data and convert into a consumable
format
12. Barriers
• It is strategically cheaper to ask a question
of the Internet than it is to answer a given
question
• How many computers are running X version of software
is easy
• How many computers are scanning for X version of
software is hard
13. Byproducts
• Observe changes in Internet-scanning over time
• Opt-out of omnidirectional scanning altogether
• Collect information on malware campaigns and
botnets
14. History
• Like three honeypots (2014)
• Animus v1 (2015)
• Bash and glue (SHMOOCON 2015 “No Budget Threat
Intelligence”)
• Related work at a previous company (2015-2016)
• EPIPHANY (2016)
• THE DATA THAT HONEYPOTS COLLECT IS SHITTY THREAT
INTELLIGENCE
• IT’S LITERALLY THE OPPOSITE OF THREAT INTELLIGENCE
• IT’S ANTI THREAT INTELLIGENCE
• Animus GOES COMMERCIAL (2017)
• Turns out startups are hard
• Grey Noise (2018)
• I’m not going to stop until I die
• ???
• Become a monk
25. Collection: Data producers /
Services
• Ridiculously aggressive iptables rules
• Log all packets
• …on all ports
• …on all protocols
• SSH
• Telnet
• HTTP
• Others
26. Collection: Data producers /
Services
(Lessons)
• MISTAKE: Tune your iptables / p0f / sniffers / whatever to
ignore garbage / outbound traffic
• LESSON: Things will be spoofed (TCP, UDP, and ICMP)
• LESSON: Bang for your buck: Iptables, HTTP, Telnet, SSH, and
P0f
31. Collection: Message Bus
(Lessons)
• MISTAKE: Google PubSub
• LESSON: Maintain state
• LESSON: Meta message envelop
• Time
• Provider
• Region
• Node UUID
• POSSIBLE: ZeroMQ, Kafka
•Streamd
32. Collection: Log Forwarder
•I wrote my own
•Python + Pygtail / iNotify / Watchdog
•Can also use something that’s already been
written
•Logstash
•Elasticsearch Filebeat
•Rsyslog
35. Analysis: Cache / Database
• PostgreSQL
• N days of data, rotates
• Fast-ish
• Robust
Dumpster
Long term storage
You’re going to fuck something up
Retro load is your friend
36. Analysis: Cache / Database
• MISTAKE: Postgres is awesome but too slow for data this big
• MISTAKE: Google BigQuery is the shit but it gets expensive if you're
doing batch queries on a very short timeline
• LESSON: Postgres + Cassandra is the truth
39. Analysis: Enrichments
• We need:
• ASN
• rDNS
• Organization
• Country
• City
• Maxmind is expensive
• Neustar is expensive
• Ipinfo is CHEAP
• Harvesting it yourself is also CHEAP but requires a lot of effort
40. Analysis: Enrichments
(Lessons)
• MISTAKE: Collecting the data yourself is hard and inconsistent and
involves a lot of work
• LESSON: ARIN has an unauthenticated non rate-limited public API for
IP ownership
• LESSON: Enrichd
• LESSON: Cache rules everything around me
44. Analysis: Analyticsd
• Service to analyze some time window of data
• E.g. past 4 days of data
• Catalogue:
• Actors
• Shodan
• Censys
• Sonar
• Activity
• Scanning for SSH
• Scanning for Telnet
• LESSON: YOU PROBABLY DON’T NEED REAL TIME ANALYTICS
• Batch analytics with small time frames
• This is why Postgres will often do the trick
• LESSON: Only pay attention to activity that has happened on more than one of your nodes
• LESSON: You need to know how many nodes are up collecting data at any point in time to
properly do a time-series analysis
47. Consumption: API
• Web API
• Tell me about this IP address
• Tell me about this analytic
• Github
• Search “Grey Noise API”
• Github.com/Grey-Noise-
Intelligence
48. Consumption: Bindings
• Bobby Filar: phyler/greynoise
• Tek: PyGreyNoise
• Bob Rudis: R bindings
• Some mystery Go bindings out there
49. Consumption: FRONT END
• Complete 100% credit to Casey Buto (github.com/cbuto)
• Point and click interface
• Hosted version at viz.greynoise.io
• EXPLORE THE DATA
51. Consumption: FRONT END
• Complete 100% credit to Casey Buto (github.com/cbuto)
• Point and click interface
• Hosted version at viz.greynoise.io
• EXPLORE THE DATA
52. OpSec (Operational Security)
• Hard to fingerprint (mostly custom services)
• Encrypt everything
• No names
• Ops domains
• Dockerize
• Shift infrastructure constantly
• Reduce the oracle surface
• IO is hard to opsec
• Minimum number / node thresholds
• Sleep delays
53. Cost
• AWS: 15 regions
• $4.75 per box
• Total: $71
• Digitalocean: 11 regions
• $5 per box
• Total: $55
• Google: 36 regions
• $4.28 per box
• Total: $154
• Total: $400 per month
Vultr: 15 regions
$5 per box (they advertise $2.50 but they're never
available)
Total: $75
Linode: 9 regions
$5 per box
Total: $45
54. Cost (notes)
• Notes:
• No Ops boxes in here (you need these)
• This is simply not enough to have complete coverage but it'll give you a good
start
• You can save money by buying extra IPs, but it complicates engineering
56. Analysis
• What am I collecting?
• Volume Summary
• Data Summary
• Actor Summary
• Benign
• Malicious
• Unknown???
• Malware Summary
• Hall of Shame (Malware-iest
regions of the Internet)
• WEIRD SHIT
• Misc Lessons
57. What am I collecting?
• Passive
• Iptables – Packets on ports
• P0f – passive OS fingerprint
• JA3 – SSL fingerprint (stick around!)
• Active
• HTTP
• SSH
• Telnet
• Experimental
• RDP
• SIP
• SMTP
• NTP
• TFTP
• DNS
58. Data Summary
• Iptables:
• I don’t have a good way to quantify this yet
• HTTP:
• Lots of ”/”, spoofed user agents, search engines, people looking for
Jboss/Wordpress/Tomcat/PHPMyAdmin
• SSH + Telnet
• Bots. Defaults cred attempts. Nothing new here.
• P0f
• Lots of OS visibility
59. Volume Summary
• With the aforementioned numbers ($400 worth of servers):
• 1M – 2M iptables events per day
• 700k – 1M SSH logins per day
• 1M – 10M telnet logins per day
• 10K – 100K HTTP requests per day
• 100-200 messages per second through your queue
• ~60K IPs per day
• 1GB of raw data, msgpacked + compressed per day
64. Pretenders
• Machines advertising client banners that are
false
• Mismatches between user agent, p0f OS fingerprint,
and JA3
• Is the browser hitting this HTTP server really
running Safari on a Linux kernel 3.1 box? Is it?
• Why? Idk
65. Dangling DNS
• When you spin a bunch of IPs up and down, it’s
not uncommon to inherit an IP address from your
cloud provider that still has a domain pointing
to it.
• CDN.whatever123.acme.com
• This traffic is dirty, you don’t want it
66. “WORM FINDER”
• Sometimes when Grey Noise observes an IP
address scanning for a given TCP port, I’ll
turn around and check to see if that port is
open on the source machine.
• If the answer is yes, this can be a great
indicator of a worm
• Why else would a computer search for behavior
that it also exhibits?
• Average lifespan from start to finish is 4 days
67. Zmap’s hardcoded ID parameter
• Zmap hardcodes all packets it creates with an
ID parameter of “54321”, making it trivial to
fingerprint
• Go to “github.com/zmap/zmap” and search / grep
the repository for “54321”
• Shoutout Oliver Gasser @ Technical University
of Munich
68. Still SO MANY WINDOWS WORMS
• LOADS of people blasting SMB traffic on TCP
port 445
• More and more RDP worms as well, but these
aren’t exploiting vulns, just guessing creds
• WinRM is next, in my opinion
69. People do weird stuff through
proxies
• Airline price scraping data (???)
• Also testing stolen credentials
• And probably credit card numbers
• News sites??? This is a huge rabbit hole…
70. Lots of robo calls probably
come from popped SIP boxes
• People try to make calls to India and Russia
through open VOIP servers
• Like, LOTS of them
• Tens of thousands per day
72. You can neuter/blow up worms by
replaying their own traffic back
to them
• A box is compromised with a Telnet worm
• The worm carries a built in wordlist
• The compromised box throws the same wordlist at
you
• You replay the wordlist back to the compromised
box
• Chances are, depending on the worm, one of
those credentials will work
74. What does the future hold?
• Version 1.1 API coming very soon
• Integrate with everything
• Badass machine learning opportunities
• Explore identifying anti-threat intelligence in
other areas
• Intranet traffic
• DMZ traffic
• Files on a filesystem
76. Conclusion
• The Internet is a noisy place
• Every packet has a story
• It’s possible to collect all of this background
noise
• If you want to explore the data, hit the API.
If the API doesn’t give you what you need,
email me or hit me up on Twitter
77. Acknowledgements
• Phil Maddox (twitter.com/foospidy)
• Bobby Filar (twitter.com/filar)
• Rich Seymour (twitter.com/rseymour)
• Casey Buto (github.com/cbuto)
• Bob Rudis (twitter.com/hrbrmstr)
• Tek (twitter.com/tenacioustek)
• Mickey Perre (twitter.com/MickeyPerre)
• Michel Oosterhof (twitter.com/micheloosterhof)