(Kind of) Big Data in InfoSec. A presentation around the usage of statistical analysis of large data sets, predictive models, and machine learning, all focusing on an infosec use case.
4. Bonafides
• Security team at Enterprise Integration
• Bachelor’s Degree in Computer Science
• US Air Force – Communications Officer
• Multiple Industry Certifications
• 11 years of experience in information security industry (yikes!)
• Supporting 26 customers – well over 40K users and 75K systems
• Do a lot of what we are going to discuss everyday!
• Would probably do this even if I wasn’t paid….
BigDatainInfoSec-BenFinke-@benfinke
6. A brief primer – Information Security
• Security is unique to every person and every business
• Confidentiality
• Integrity
• Availability
• Compliance versus Defense
• What is a security event?
• Prevention and Detection
• What is the real target for attackers?
BigDatainInfoSec-BenFinke-@benfinke
7. So are attackers the big concern?
Yes and No
Any downtime or loss of data can be a security event. However, the vast
majority of the problems we will be focusing on today will involve a third party
who wants something we have.
BigDatainInfoSec-BenFinke-@benfinke
8. Who are these attackers anyways?
BigDatainInfoSec-BenFinke-@benfinke
DifficultytoDefend
Capabilities and Resources
Script Kiddie
Militia/Terrorist
Groups
Hacktivists
Corporate
Espionage
Organized Crime
Nation States
9. Botnets
• Organized crime groups have large and complex ecosystems, one of which is
the creation and maintenance of huge botnets – compromised computers
that can be leveraged as part of attacks or scams.
BigDatainInfoSec-BenFinke-@benfinke
http://www.spamhaus.org/news/article/720/spamhaus-botnet-summary-2014 Waledac Botnet - http://blogs.microsoft.com/blog/2010/02/24/cracking-down-on-botnets/
10. Stuxnet and its legacy
Stuxnet targeted very specific
Programmable Logic Controllers
(PLC)
It was one of the most complex pieces
of malware ever written.
There have been follow-on variants
similar. But Stuxnet was the first to
be viewed by most researches as a
weapon created by a Nation State.
Commercial cybercrime operators are
undoubtedly studying this incredible
engineering to improve their own
product….
BigDatainInfoSec-BenFinke-@benfinke
http://spectrum.ieee.org/telecom/security/the-real-story-of-stuxnet
11. We haven’t been gaining ground
BigDatainInfoSec-BenFinke-@benfinke
VerizonDataBreachInvestigationreport–2015-
http://www.verizonenterprise.com/DBIR/2015/
12. But how could this be?
• How many Billions of $ are spent on security every year?
BigDatainInfoSec-BenFinke-@benfinke
Source: http://www.wsj.com/articles/financial-firms-bolster-cybersecurity-budgets-1416182536
13. So it’s not $... What’s happening?
• We’ve lost the defender’s advantage..
– Organizations don’t know the terrain
– Maintaining operations in a secure state takes work
– We’ve been betrayed by our own information systems….
BigDatainInfoSec-BenFinke-@benfinke
14. The ugly truth about modern IT systems
“Unfortunately, modern computing and communications technologies, for all their benefits,
are also notoriously vulnerable to attack by criminals and hostile state actors.”…..
“It is a regrettable (and yet time-tested) paradox that our digital systems have largely become
more vulnerable over time, even as almost every other aspect of the technology has (often
wildly) improved”…..
“Modern digital systems are so vulnerable for a simple reason: computer science does not yet
know how to build complex, large-scale software that has reliably correct behavior. This
problem has been known, and has been a central focus of computing research, since the dawn
of programmable computing. As new technology allows us to build larger and more complex
systems (and to connect them together over the Internet), the problem of software correctness
becomes exponentially more difficult.”
Matt Blaze
April 2015
http://www.crypto.com/papers/governmentreform-blaze2015.pdf
BigDatainInfoSec-BenFinke-@benfinke
15. Anyone recognize these?
We’ve started seeing the designer vulnerability – with great marketing and
all!
These vulnerabilities take advantage of underlying software that so many
other systems are built on, causing panic and confusion about what is actually
vulnerable. Worse, many “appliances” leverage vulnerable versions, and
patches and upgrades can take months….
BigDatainInfoSec-BenFinke-@benfinke
Heartbleed Shellshock Poodle Ghost Venom
16. But wait, there’s more!
I don’t really know how to put this, so I’m just going to put it….
Verizon 2015 Data Breach Investigations Report
http://www.verizonenterprise.com/DBIR/2015/
BigDatainInfoSec-BenFinke-@benfinke
17. Bruce Schneier
• Pioneer in information security…
BigDatainInfoSec-BenFinke-@benfinke
“I am regularly asked what
the average Internet user
can do to ensure his
security. My first answer is
usually “Nothing, you’re
screwed’.”
18. Bruce’s quote notwithstanding…
We have a situation where we know the building blocks of our IT systems will
continue to go through the research-vuln-patch-fix cycle.
What this really means is that pure prevention is impossible. We simply can
not prevent our IT systems from from being compromised without fail.
We need to focus on developing resilient systems where detection and
correction are enhanced.
And so we began collecting logs….
BigDatainInfoSec-BenFinke-@benfinke
19. A lot of logs…
BigDatainInfoSec-BenFinke-@benfinke
20. Big Data in Security
• Logs from all kinds of different systems
– Authentication Logs
– Firewall Activity
– Network Devices
– Windows Event Logs
– Linux Logs
– Web Server Logs
– Anti-Malware Logs
– Web Proxy Logs
– Email Security Logs
– Web Application Firewalls logs
– Database activity
– Cloud Services*
• Other data we can apply….
*Notoriously difficult and not usually timely
BigDatainInfoSec-BenFinke-@benfinke
Logs
Netflow
Network Security Monitoring
Derived Data
Security Testing Data
Context
21. Context
Information about the environment that a
human being would know or infer.
• Critical Systems List
• Admin-level accounts
• Location (Of person and device)
• Hardware Inventory
• Software Inventory
• Internet Facing?
• Open Tickets
• Change Controls
• System history
• Network Location
BigDatainInfoSec-BenFinke-@benfinke
Logs
Netflow
Network Security Monitoring
Derived Data
Security Testing Data
Context
22. What can context do?
• Firewall log event:
May 22 14:02:51 172.21.250.1 %ASA-6-302013: Built inbound TCP connection 237062557 for
outside:58.71.107.127/44975 (58.71.107.127/44975) to dmz:192.168.250.130/443 (74.129.196.130/443)
• Context:
Source IP Country – China
Reputation Score – Reported Spam and Web Login Brute Forcing
DShield Listing – Active Attacker list
Associated IOCs - ….
Previous Communications …
BigDatainInfoSec-BenFinke-@benfinke
23. Security Testing Data
Incorporate the information we generate during
security testing efforts.
• Vulnerability scans
• Web Application security assessments
– What kinds of requests would indicate attack
activity?
• Bridge between network segments?
– Attackers look to pivot to gain access
• Identified services and ports
– Something new show up?
BigDatainInfoSec-BenFinke-@benfinke
Logs
Netflow
Network Security Monitoring
Derived Data
Security Testing Data
Context
24. Security Testing Data
• Firewall log event:
May 22 14:02:51 172.21.250.1 %ASA-6-302013: Built inbound TCP connection 237062557 for
outside:58.71.107.127/44975 (58.71.107.127/44975) to dmz:192.168.250.130/443 (74.129.196.130/443)
• Vuln Testing Details:
Target server is Windows 2012R2 running IIS
Critical Apps – Yes (SharePoint)
Vulnerability Status – 0 critical, 0 highs, 1 medium, 2 low
Behind Web Application Firewall – Yes
Part of HA – Yes
Mapped DB instances - …….
BigDatainInfoSec-BenFinke-@benfinke
25. Derived Data
Information about a system that needs to be
generated by a script or action.
• Netstat
– What is this system talking to?
• Running processes
– Everything we expect to see and nothing more?
• Logged On Users
– Active Window and Idle time
• State of defensive components
– Status of AV and HIPS?
BigDatainInfoSec-BenFinke-@benfinke
Logs
Netflow
Network Security Monitoring
Derived Data
Security Testing Data
Context
26. Network Security Monitoring
Using purpose built platforms to analyze
everything passing by on the network.
• Passive Endpoint Detection
• SNMP and Syslog activity
• Encrypted traffic analysis
• PKI Certificates in use
• Traffic matching IDS signatures
• Packet Captures
BigDatainInfoSec-BenFinke-@benfinke
Logs
Netflow
Network Security Monitoring
Derived Data
Security Testing Data
Context
27. Netflow
Summarizing all network communications
between systems.
• Allows lengthy retention
• Easy baselining of network activity
• Capture utilization statistics
• Identify new traffic patterns
BigDatainInfoSec-BenFinke-@benfinke
Logs
Netflow
Network Security Monitoring
Derived Data
Security Testing Data
Context
28. A good start..
Hooray for logs! We certainly have a lot of data (at least we thought we did)
Average size network (~1000 users) = 100 GB/day **
It quickly became evident that a new practice was necessary for information
security teams – what are we going to do with all of this data?
BigDatainInfoSec-BenFinke-@benfinke
29. Big Data Platforms in InfoSec
• ELK – Elastic, Logstash, Kibana
• ELSA
• Greylog2
• Splunk
• Commercial SIEMs
• Lots of custom Hadoop lashups
• Only as good as the analysts who take care of them
• Lack of good tools to build predictive models
• Lack of good tools to build useful visualizations
• Lack of good integration into the overall defensive systems
BigDatainInfoSec-BenFinke-@benfinke
30. Capability Example - Splunk
Collects structured or unstructured data
Field Extractions
Statistical Tools built into search interface
Visualization engine
Software that scales nicely on commodity hardware
Writing “apps” for Splunk
Connect to DBs and Hadoop Clusters (and some MongoDB goodness too!)
BigDatainInfoSec-BenFinke-@benfinke
32. So what can we achieve with these?
• Correlation
• Context
• Log Shipping – the art of collecting logs from critical systems and delivering
them to the log management collectors, as close to real time as technically
feasible.
• Real-time logging = Real Time alerting, forensics, and statistics
• Batch Logging = Forensics and statistics, with relative alerting
• Oh yeah, and you may hear the phrase “Kill Chain”….
BigDatainInfoSec-BenFinke-@benfinke
33. The Kill Chain
Source: SecureState - http://blog.securestate.com/open-source-threat-intelligence-sony-breach/
BigDatainInfoSec-BenFinke-@benfinke
34. Example
BigDatainInfoSec-BenFinke-@benfinke
User logs into the VPN
• IP Address = IP X
• User identity = UID
• User location
• Create a session for UID
Business Application reports login for
“admin” from IP X
• IP X is tied to VPN session
• We map true person to account
• Verify against user traffic profile
Intranet Site reports traffic from IP X
• IP X is tied to VPN session
• Unauth activity is attributed
• Transaction is added to user session
NSM reports activity by IP
• Active Device Fingerprinting
• Application identification
• Malicious activity detection
35. Remote Access and External Services
A quick word problem: If a user logs in from Jacksonville, FL at 9 AM and
Chicago, IL at 1030 AM, is it possible this is the same actual person?
What about Jacksonville at 8 AM and Amsterdam at 11 AM (UTC)?
Using a haversine function we can tell the distance between two geolocations
We can then use the distance and time difference to determine if a given
action is likely to actually belong to the correct person.
Chart this both by biggest deltas in distance and in required speed.
BigDatainInfoSec-BenFinke-@benfinke
36. Example 2
BigDatainInfoSec-BenFinke-@benfinke
User logs into local network
• IP Address = IP X
• User identity = UID
• Device in use
• Create a session for UID
Business Application reports login for
“admin” from IP X
• IP X is tied to local session
• We map true person to account
• Verify against user traffic profile
Intranet Site reports traffic from IP X
• IP X is tied to local session
• Unauth activity is attributed
• Transaction is added to user session
Wireless network activity
• Auth network by UID
• Tie multiple devices to UID
39. So how well is that working?
SANS 2012 Survey - http://www.sans.org/reading-room/whitepapers/analyst/eighth-annual-2012-log-event-management-survey-results-sorting-noise-35230
BigDatainInfoSec-BenFinke-@benfinke
40. The birth of Threat Intel
These tools now enable us to find and start blocking attack activity
BigDatainInfoSec-BenFinke-@benfinke
Attack
Logs Sent to SIEM
Analyst sees attack,
enables block via
defenses
41. This happens all day, every day
Blue teams generate tons of data on their own about attackers
IP Addresses
Domain Names
Email Subject Lines
Malware Behavior
The natural question: How can we get access to all of this data that others are
collecting, and share what we see?
BigDatainInfoSec-BenFinke-@benfinke
43. Threat Intelligence!
Because of course we want to be smart about it!
Various formats and protocols emerge to share this info
BigDatainInfoSec-BenFinke-@benfinke
44. Not to mention commercial offerings
BigDatainInfoSec-BenFinke-@benfinke
45. Surprisingly, its not perfect!
“Why did this domain get listed as malicious again?”
“This list has over 2 million IP addresses in it!”
“So that breach we just had…. None of those IPs or
domains were in our threat lists…”
“We added that block list to the firewall, but now the
config file is bigger than the device can handle…”
BigDatainInfoSec-BenFinke-@benfinke
46. Problems
• The only place to put all this stuff is in the SIEM
• Almost everything is entirely reactive (AV Signatures)
• Threat “Intel” can create lots of noise for the humans
• Threat Intel sources are (almost always) very expensive, even for large
companies
• Loss of context for why a thing is bad
• False Positives and Botnets (your Mom’s PC, probably)
• Threat Intel sources suffer from a numbers problem…
BigDatainInfoSec-BenFinke-@benfinke
47. What can we do?
BigDatainInfoSec-BenFinke-@benfinke
48. MLSec Project
• Machine Learning Security Project
• Provides research and tools to help organizations understand how effective
this threat intel is, and how they can leverage machine learning and
predictive models into their information security operations.
• http://www2.mlsecproject.org/
BigDatainInfoSec-BenFinke-@benfinke
49. MLSec Projects
• Combine
– Python program to harvest intel feeds from various sources
• SecRepo
– Repo of data samples to assist during development and testing of security
integrations with machine learning and predictive models….
• TIQ Test
– Statistical comparison of threat intel data – provides visual output!
• Thanks to Alex for all the help!
• TIQ Test was featured in the 2015 Verizon DBIR report. How did they do?
BigDatainInfoSec-BenFinke-@benfinke
50. Lots of overlap,
right?
• Nope. Hardly at all.
• 97% of intel was unique
BigDatainInfoSec-BenFinke-@benfinke
51. Blue Team Nirvana
• Human analysts training an army of machine learning robots
• Scale is met by the blue team robots
• Humans do the creative stuff
• Real-time sharing of threat indicators for later use (context)
• Automating reactions to detected threats
• Distributed early warning systems
– Honeypots
– Sandboxes
– Network Security Monitoring
The end goal is to have machines handle all events and research, and
only present data to humans to have a decision made.
Over time the machines learn and act just like the human analysts.
Free the humans to do what humans do best!
BigDatainInfoSec-BenFinke-@benfinke
52. Machine Learning/Predictive Models
• Behavior Anomalies
– Has this user ever logged into this application before?
• Network Traffic Anomaly Detection
– “Did this ever talk to that before?” and “Does traffic volume from each system look
right?”
• Incident Response Automation
– Can this machine be reliably cleaned by our tools and techniques?
• Obvious Attack Blocking
– That http request looks like a RFI attack against PHP, we run .Net – Block
• Reviewing possible security events
– Float the really interesting stuff up to the humans
• But that’s sort of obvious stuff that lots of folks are trying (which will be
awesome!!)
BigDatainInfoSec-BenFinke-@benfinke
53. Is that website likely to be hacked?
https://www.usenix.org/system/files/conference/usenixsecurity14/sec14-paper-soska.pdf
BigDatainInfoSec-BenFinke-@benfinke
54. Is that software vulnerable?
VDiscover – Improving binary software vulnerability detection through ML
BigDatainInfoSec-BenFinke-@benfinke
http://www.vdiscover.org/
55. Identify Users at Risk
• We’ve been developing a scoring system that ranks the most at risk users.
• Considers dozens of metrics, including:
• Email Activity (inbound # of domains, outbound, etc.)
• Web Activity
• Authentication Activity
• Incident Tickets
• Phishing Exercise performance
• Endpoint systems used
• Access to critical or sensitive systems
• Wireless networks configured
• Findability (how much information is available online)
• Position within the organization
• And more!
BigDatainInfoSec-BenFinke-@benfinke
Derived from
historical review and
modeling.
56. Predictive Models for Pentesters
BigDatainInfoSec-BenFinke-@benfinke
Security tests are really useful for simulating a specific problem, especially an
attacker attempting to gain access to critical systems.
Usually these tests function under sever time and resource constraints
Let’s use machine learning and predictive modeling to make our assessments
more effective!
Considers factors like
• Pivot Capabilities
• Vulnerability Likelihood
• System role
• Used by admins
• Users most vulnerable to SE attacks
• Discovering relationships between systems and components
57. Novel approaches to applying ML/PM
• “Just in Time” Context for events (Team Cymru, Internet research, etc.)
• Improving Security Testing outcomes (PM for Pentesters!!)
• Building a “Phish” score for customers
• Using customer metadata as a signature
• Using Machine Learning to score the security of a “gold image”
• Building predictive models from early warning systems (honeypots)
• Using Predictive Models to block external sources based on ticket data
• Machine Learning to “shadow” and analyst and emulate (to scale!)
• Using Predictive Models to score system vulnerability levels
BigDatainInfoSec-BenFinke-@benfinke
58. And lots more on the way…
Organizations that position themselves to utilize their existing log
management tools will be able to take advantage of the coming wave of
machine learning and predictive models. This will enable rapid sharing and
implementation of threat intelligence as well.
Without a good foundation, these tools will simply provide more noise and
work. While every organization is anxious to leverage these, you need to
answer these questions first:
• Do we have a complete inventory of all the devices on our networks?
• Do we know the security posture of those systems?
• Do we have a Single Point of Truth that we trust?
• Do we have the appropriate information from our critical systems collected
by our log management system?
• Do we have baseline profiles for our users and our critical applications?
• Do we have defined incident response procedures?
BigDatainInfoSec-BenFinke-@benfinke
59. Thank you! Any questions?
BigDatainInfoSec-BenFinke-@benfinke