Sequential and reinforcement learning for demand side management by Margaux B...
CERT Data Science in Cybersecurity Symposium
1. CERT DATA SCIENCE IN CYBERSECURITY SYMPOSIUM2017-08-10
DATA SCIENCE IN CYBERSECURITY
PAST • PRESENT • FUTURE
It’s mid-day. We’re all rested from our mid-day repast & I’d like us to pause for a minute, “take a knee” (in American football parlance), and look at where we’ve been,
where we are and what at least I believe the future holds for as we seek to use data to help detect & deter attackers in our defense of data, systems, networks and
“things”.
2. ABOUT://ME
20+ YEARS DEFENDING
FORTUNE 100 ORGANIZATIONS
WILEY AUTHOR
DBIR TEAM LEAD
RAPID7 CHIEF DATA SCIENTIST
@HRBRMSTR
HTTPS://RUD.IS/
2017 Q2
By Rebekah Brown, Threat Intelligence Lead, Rapid7, Inc.
Bob Rudis, Chief Security Data Scientist, Rapid7, Inc.
Jon Hart, Senior Security Researcher, Rapid7, Inc.
Dustin Myers, Threat Intelligence Researcher, Rapid7, Inc.
Vasudha Shivamoggi, Data Scientist, Rapid7, Inc.
Philip Thomsen, Data Science Intern, Rapid7, Inc.
August 2, 2017
RAPID7 QUARTERLY
THREAT REPORT
Intent, Capability, Opportunity, and the Threat Landscape
NATIONAL
EXPOSURE INDEX2017
Rapid7 Labs | June 14, 2017
Bob Rudis, Chief Security Data Scientist, Rapid7
Tod Beardsley, Research Director, Rapid7
Jon Hart, Senior Security Researcher, Rapid7
I’ve been doing this “cyber things for a while…” (rest of intro)
4. DATA SCIENCE
MATHS
STATISTICS
PATTERN RECOGNITION
MACHINE LEARNING (“AI”)
DEALING WITH UNCERTAINTY
COMPUTATION
ALGORITHMIC THINKING
PROGRAMMING
DATABASES
HIGH PERF. COMPUTING
one thing that’s important to mention here is the lack of a phrase that we often hear together: “BIG DATA”. While we definitely have some form of “big data” problems in
cybersecurity (as it relates to data science tasks) we have tons of small to medium sized data.
5. DATA SCIENCE
MATHS
STATISTICS
PATTERN RECOGNITION
MACHINE LEARNING (“AI”)
DEALING WITH UNCERTAINTY
COMPUTATION
ALGORITHMIC THINKING
PROGRAMMING
DATABASES
HIGH PERF. COMPUTING
LIBRARY SCIENCE
INFORMATION LITERACY
determine the extent of information needed, access the needed information effectively and efficiently, evaluate information and its sources critically, incorporate selected
information into one’s knowledge base, use information effectively to accomplish a specific purpose, and understand the economic, legal, and social issues surrounding
the use of information, and access and use information ethically and legally.
6. DATA SCIENCE
MATHS
STATISTICS
PATTERN RECOGNITION
MACHINE LEARNING (“AI”)
DEALING WITH UNCERTAINTY
COMPUTATION
ALGORITHMIC THINKING
PROGRAMMING
DATABASES
HIGH PERF. COMPUTING
LIBRARY SCIENCE
INFORMATION LITERACY
COMMUNICATION
DATA VISUALIZATION
RELATING UNCERTAINTY
“REPORTING”
7. DATA SCIENCE
MATHS
STATISTICS
PATTERN RECOGNITION
MACHINE LEARNING (“AI”)
DEALING WITH UNCERTAINTY
COMPUTATION
ALGORITHMIC THINKING
PROGRAMMING
DATABASES
HIGH PERF. COMPUTING
LIBRARY SCIENCE
INFORMATION LITERACY
DOMAIN EXPERTISE
COMMUNICATION
DATA VISUALIZATION
RELATING UNCERTAINTY
“REPORTING”
in my definition, “data engineering” lies somewhere between computation, library science and domain expertise
11. WHERE WE’VE BEEN
A BRIEF — MOSTLY ACCURATE — HISTORY OF DEFENSE & “DATA SCIENCE” IN CYBERSECURITY
We really can’t look ahead until we understand where we’ve been, so let’s start with a brief, mostly accurate, history of defending systems and “doing data science” in
cybersecurity. Some of this is “why are we even here?”, meaning why did we (as a field/profession) make the choices we made to force us into such a position today.
That’s important since that question is something we’re going to circle back to at the end.
12. The Internet (if you’ll allow me to brand the nascent ARPANET at that time the “internet”) was much smaller back in the day. A few handfuls of systems interconnected on
excruciatingly slow (by modern standard) communications lines. It’s somewhat amazing that researchers even got funding to do that work back in the day but we should
all be glad they did.
13. That tin-can-telephone is not much off the mark as the 1970 ARPANET was tiny, even compared to many home networks today.
14. ARPANET wasn’t around for that long before Bob Thomas wrote the “Creeper” worm, which traipsed across ARPANET systems printing a snarky challenge message on
the teletypes of those days. Ray Tomlinson “improved” on this technique and had Creeper replicate as well as move. He also wrote Reaper which traipsed across the
ARPANET and removed Creeper and it’s heftier cousin.
15. Now, you’d have thought that the advent of Creeper & Reaper on this fledgling bastion of hyper-connectivity would have ensured communication protocols and systems
design would include active defense against adversaries.
16. And, you’d be wrong. Creeper was what we’d call, today, a Proof of Concept — PoC — since it wasn’t malicious (per se). In fact, Creeper and Reaper were created
primarily out of curiosity. Researchers pondering if something was possible and then using software & communications channels to do something no one had foreseen it
would be used for.
17. ARPANET eventually gave way to what we’d call the modern Internet — with it’s TCP/IP foundations — in 1982. Three (3) whopping TLDs and a printed directory of what
was where. Queue “back in my day”…
18. “We didn’t focus
on how you could
wreck this system
intentionally.”
—Vint Cerf
It’s important to stop and remember that the research project that ultimately begat ARPANET which begat the Internet as we know it today was designed as a mitigation
to a threat model itself: nuclear war. This redundant networking foundation had connectivity as it’s primary goal. The focus was entirely on being able to communicate
during a global disaster. They weren’t really even threat modeling the network or systems at that time. The folks using it were pure academics or former academics
working for large companies.
19. “It’s not that we didn’t
think about security.
We knew there were
untrustworthy people
out there, and we
thought we could
exclude them.”
—David Clark
Even when evidence of destructive misuse became prevalent, the creators of this new digital world were caught off guard, as David Clark himself mused in this quote
where he nostalgically reminisced about the Morris Worm that did massive damage in 1988. (more on that in a bit)
20. SO…BY EARLY 1980’S WE HAD THE:
- FIRST HARMLESS WORM
- FIRST SPAM
- FIRST “HACKING”
- FIRST FIREWALL
roughly 30 years ago
21. SO…BY EARLY 1980’S WE HAD:
- FIRST HARMLESS WORM
- FIRST SPAM
- FIRST “HACKING”
- FIRST FIREWALL
NO
“CYBER DATA
SCIENCE”
roughly 30 years ago
22. But hold, on, things are about to move pretty fast and we are going to start going all in on developing data driven solutions to some of these problems (it's not all good
news tho).
23. FEATURE GENERATION / ALGORITHMIC THINKING
• SIGNATURE ANTI-VIRUS
“The Brain” inspired signature a/v (which is a really naive feature generation & comparison algorithm, so it “counts” as data science)
24. FEATURE GENERATION / ALGORITHMIC THINKING
• SIGNATURE ANTI-VIRUS
• SIGNATURE INTRUSION DETECTION SYSTEMS (IDS)
Signature based intrusion detection also started being designed at this time.
25. FEATURE GENERATION / ALGORITHMIC THINKING
• SIGNATURE ANTI-VIRUS
• SIGNATURE INTRUSION DETECTION SYSTEMS (IDS)
• HEURISTIC ANTI-VIRUS
Back in the day program binary structure was highly uniform and simple heuristics based on generated features were becoming differentiators even in shareware land
27. FEATURE GENERATION / ALGORITHMIC THINKING
• SIGNATURE ANTI-VIRUS
• SIGNATURE INTRUSION DETECTION SYSTEMS (IDS)
• HEURISTIC ANTI-VIRUS
ANOMALY DETECTION / PATTERN RECOGNITION
• IDS
The Morris worm wasn't the only causal factor for enhancements in data-driven network defense solutions but it was a good one. Time series anomaly detection and
decision tree online patten recognition became part of the defenders toolbox
28. FEATURE GENERATION
• SIGNATURE ANTI-VIRUS
• SIGNATURE INTRUSION DETECTION SYSTEMS (IDS)
• HEURISTIC ANTI-VIRUS
ANOMALY DETECTION / PATTERN RECOGNITION
• IDS
• SYSTEM MISUSE DETECTION (INITIALLY MAINFRAME)
DATA ENGINEERING + NASCENT CORRELATION
Queue the 90s and the early aughts and we have log management and correlation internally and projects like dshield (SANS to you folks today) leveling up the data
engineering side of thing
29. And, we thought, “all we need is more data!”. So we complained about not having enough logs from enough systems going to the right places so we could detect and
stop all these nefarious ne’er do wells.
30. but we were still defending in a castle & moat mindset. We thought we had visibility into data ingress & egress.
31. So. Many. Blinky. Lights.
NIDS
HIDS
IDS
IPS
Active Directory
LDAP
Weblogs
Database logs
A/V Logs
And we were cramming it all into our racks of high density storage tied to systems with blinky lights and we were so confident that we could finally protect the castle with
our rudimentary tools. And, despite some mis-steps along the way. We were doing OK.
33. The peasants, I mean, workers, were freed from the confines of the castle.
34. Today’s workers are likely carrying multiple devices, each orders of magnitude more powerful than all the computers on the initial ARPANET put together, transferring
data at speeds nigh unimaginable. Each device is a gateway to data and even the internal network itself. We may have secured the gates and some internal buildings, but
these folks are super vulnerable. Just like the ARPANET founders, we failed to threat model properly and fast enough to prepare for this.
35. REACTIVE PERIOD
LACK OF RESOURCES FOR
ACTIVE DEFENSE
LACK OF APPROPRIATE
THREAT MODELING
FAILURE TO ANTICIPATE
CHANGE
LETTING THE ATTACKER LEAD
36. WHERE WE ARE
THE “GOLDEN AGE OF SECURITY DATA SCIENCE”
Which brings us to today. I’d argue that we’re in the golden age (i could be persuaded to call it the silver age) of cybersecurity data science.
43. and we haven’t even managed to defend against clever folks writing worms that display snarky messages
44. “People don’t
[break into banks]
because they’re
not secure. They
do it because
that’s where the
money is.”
—Janet Abbate
Janet Abbate is an historian who has written fairly extensively about the founding and evolution of the internet. I bracketed out the “break into banks” because it's not rly
about banks. it’s about understanding the goals of our intelligent adversaries and using data science to make it harder to achieve their goals.
45. REACTIVE PERIOD
LACK OF RESOURCES FOR
ACTIVE DEFENSE
LACK OF APPROPRIATE
THREAT MODELING
FAILURE TO ANTICIPATE
CHANGE
LETTING THE ATTACKER LEAD
ACTIVE PERIOD
46. YOU CAN’T SUCCEED IN
CYBERSECURITY DATA SCIENCE
WITHOUT AN EFFECTIVE
RISK MANAGEMENT PROGRAM
47. Humans
Mainframes
macOS/Windows
IPv6
Cloud Computing
Third-parties
Biz Dev
App Dev
M&A
Data Centers
Email
Social Media
Threat Intel
Anti-malware
Firewalls
Routers
DNS
Servers
IDS/IPS
Linux
Vulnerabilities
Hacking
DoS
iOS/Android
Ransomworms
Espionage
Credentials
you can’t do everything and you can’t do all the things you think you want to do all at once
48. WHAT YOU HAVE
There are three core elements when designing present-day active data-driven security strategies
51. REACTIVE PERIOD
LACK OF RESOURCES FOR
ACTIVE DEFENSE
LACK OF APPROPRIATE
THREAT MODELING
FAILURE TO ANTICIPATE
CHANGE
LETTING THE ATTACKER LEAD
ACTIVE PERIOD
THREAT + RISK MODELING
DATA ENGINEERING
TARGETED DEPLOYMENT OF
(TASK-SPECIFIC)
ALGORITHMS
EXCEPTIONAL
COMMUNICATION
52. WHAT THE FUTURE HOLDS
“HERE BE DRAGONS”
Which brings us to today.
53. It gets worse for us. Everything is being connected and virtually nobody cares about security or privacy. Many give lip service to it. Governments feign concern. We’re
repeating many of the same mistakes the ARPANET folks did, just at scale. This is a target-rich environment for cybersecurity data science
54. And, we’re not just connecting ‘things’ but we’re also connecting ourselves. You’re not running tensorflow on a pacemaker or an insulin pump any time soon, but
cybersecurity data science is going to be critical to ensure the safety of these devices (and the folks they’re keeping alive).
55. Our adversaries aren’t just in the digital world, too. There’s a fledgling but rapidly growing field of adversarial machine learning where there are deliberate efforts to do
things like make autonomous vehicles fail through vandalizing street signs, introducing other items into the visual, audio and environmental sensors to throw them off.
This is a great space to bring in the Metasploit crafters and have them bring their art to bear in the realm of machine learning. I should also note that you have other types
of adversarial issues today. Attackers know where you get your feeds from and they also know when you’re probing their systems on the internet. The good ones are
messing just enough with your ground truth to make your data-driven defense efforts less effective. The future version is going to cost lives as well as $.
56. Unless something occurs to cause a radical change in direction, we’re headed towards fully online systems for voting at almost every government level. When you have
the added conundrum of anonymity at both the user (voter) level and attacker level how do you develop sufficient statistical methods to detect attacks or determine when
democracy itself is being hacked.
57. We’re sticking cameras and microphones everywhere. You’ve all likely got one (or more) on you now that may be configured to be listening for it’s wake up command. It’s
hard enough implementing effective data-driven monitoring and defense models in a resourced organization. How do we enable data-driven protections at home? What’s
more, when these devices make their way into your organizations (they likely are there already), how will you threat model them? Heck, you may even need to come up
with data-driven ways to detect them first.
58. And we’re letting ourselves be tracked all the time in the most intimate ways. This is extraordinarily valuable data. How will you develop data-driven detection and
defense methods for it? And, make no mistake, you will have to, even if you don’t work for one of the companies in the photo. This data is now and will absolutely be for
sale in the future. Your going to be responsible for defending it at some point.
59. COGNIZANT PERIOD
IT’S NOT JUST ABOUT “SECURITY”
I call this the cognizant period because it’s both about awareness and ownership.
We’re quickly moving beyond bitcoins and credentials. It’s about safety and having a handle on all the moving parts.
60. pretty much like the computer on the enterprise. now, this is both a good and bad example since the enterprise had far more security incidents than it should have had,
but the system could communicate ship status effectively and notify about potentially harmful anomalous readings.
What if one of the features of your machine learning model for industrial control systems was having knowledge of the power consumption of the ICS sensors or ICS
modules themselves and using that (with other features) to train a classifier to predict whether something was impacted with malware (if cameras had had this feature
today, Mirai would be limited to a handful of systems vs upwards of 4m).
If your machine learning systems were to be trained to understand human patterns better, it would likely know that your CFO never sends mail at 3:30 AM from address
ranges in strange autonomous systems and connecting through gmail. If we can use convolutional neural networks in high precision image recognition, there’s no reason
to believe we can’t do something similar in this space with the right kind of thinking.
61. COGNIZANT PERIOD
IT’S NOT JUST ABOUT “SECURITY”
ACTIVE, ADAPTIVE, APPLIED
“DATA SCIENCE”
active recommendation systems
automated rule creation and learning from adapted rules
identifying improvement areas based on previous faults (go through app dev example)
62. active recommendation systems
automated rule creation and learning from adapted rules
identifying improvement areas based on previous faults (go through app dev example)
learning from human input
63. COGNIZANT PERIOD
IT’S NOT JUST ABOUT “SECURITY”
ACTIVE, ADAPTIVE, APPLIED
“DATA SCIENCE”
CONTINUOUS AWARENESS &
SHARING
active recommendation systems
automated rule creation and learning from adapted rules
identifying improvement areas based on previous faults (go through app dev example)
64.
65. IT IS PARAMOUNT THAT WE DO NOT
REPEAT THE MISTAKES OF THE PAST AND
WORK TO ENSURE THAT “DATA SCIENCE”
BECOMES AN ACTIVE, INTEGRAL PART OF
DEVELOPMENT, DETERRENCE & DEFENSE