We present solutions on how to make the cyberspace secure through feature-rich, robust, yet lean machine learning-based algorithms that help organizations identify malicious actors, intruders and illegal system access by studying features that arise purely from system login behavior.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
2. Few Problems in Cybersecurity
1. Malicious external/internal threat (Phishing, Malicious Domains,
etc.)
2. Large scale attacks (DDoS, Spam campaign, etc.)
3. Data loss (Data Ex-filtration)
4. User behavioural analytics (Inside threat, account take over)
These are primary problems enterprises are interested in
solving as it directly affects business.
3. How are these cybersecurity problems handled?
1. Rule Based systems
2. Large scale user of experts who understand systems well
3. Expert identification of conditions and their combinations which are
true markers of malicious behaviour
4. Multiple security professionals who understand specific conditions
and combination, and can identify malicious behaviour
4. Is this justified?
YES.
Why?
1. Cyber Security's focus is to identify every instance of malicious
behaviour and not leave things to probability.
2. Risk associated with each security event is large. Thus, making
identification of each event very important.
5. What is the problem with this approach?
1. It takes time as large amount of logs need to be analysed and
threats must be identified as real/potential/false positive.
2. Requires experts, large number of professionals.
3. It is a manual process and requires investigation with associated
events, multiple logs - considerably slow.
4. Even with a thorough investigation it is possible that a malicious
event could be missed - anomalous.
6. Outlier? Anomalous?
1. Outliers are simply put events (when statistically modeled) have a
low probability of occurrence.
2. Anomalies are events that have never been seen.
3. Identifying anomalous events is difficult.
7. How do you solve this problem?
1. Create a malicious behaviour context based on your domain
knowledge
2. Using the context to statistically transform the anomalous
behaviour as an outlier or at least as a unique occurrence.
3. See if the model fits your contextual assumptions.
8. Example
1. Studying successful Windows user login times for the entire
enterprise does not yield interesting behaviour.
2. Studying these user logins in context is important.
3. Understanding that login patterns of general users, administrators
and system account accounts are different.
4. Also, understanding that different kinds of logins, physical systems
logins, network based, remote, unlocks, caches logins are different
in behaviour.
5. Interactions between types of users and types of logins also yield
unique behaviour. Each analytical context is associated with a
certain expected behaviour. Any violation of this expected
behaviour is flagged and studied.
9.
10. The Problem? Even Now?
1. The biggest problem even now is that there is no ground truth for
us to identify that a behaviour identified as unexpected, outside its
context is truly anomalous.
2. Therefore we end up with the problem of unsupervised process
3. Anomalous behaviour detection in cyber security is unsupervised
Only Data tells us the truth. We validate our analysis using feedback.
11. How do we solve this?
1. We still have experts who can identify if these identified behaviours
are indeed malicious
2. The information we provide speeds up the analytics and
investigation
3. The building of context and statistically identifying unexpected
behaviour reduces the need to go through unnecessary data.
4. We use this feedback at multiple levels,
a. improve features that go into the context
b. modify context itself
c. look at changes in thresholds
d. use the feedback as a mechanism to turn the problem into a
supervised problem.
12.
13. Event Correlation and Behavioural Identification - A
perfect segway to log correlation.
14. 1. The idea of context is used where malicious behavioural
identification is important.
2. Individual logs - system, network logs are not comprehensive
enough to identify anomalous events on their own.
3. Therefore using log correlation to identify events and building a
context around the event is important.
4. Individual events can never be considered in vacuum.
5. The logs primarily correlated by time and then by possibly
connected events.
15. Example of Event/Log Correlation - An example of an event
A user account with multiple failed logins, followed by a successful login.
The successfully logged in machine connected to a database servers,
requested a database dumb and this data was downloaded back to the
machine.
Identifying these events, and identifying that these events are happening in
a series is is correlated events.
16. Let's break these events down. You have,
1. Multiple login attempts and 1 final successful login ( could be interpreted
as a user trying his password wrongly - we all do that)
2. A connection to a database server (totally harmless)
3. A dump of the data on the machine (might be creating a new database
and took a dump)
4. Moved the dump of data to the local machine (Totally fine if someone
wants to work on the data locally)
17. The Analysis of correlated events
1. Here we have 4 different events which tell us a story only when there is
correlation.
2. Correlation is important because behavioural anomalies described earlier
are not statistical outliers. They are unseen data points.
3. These anomalies surface after observing the interactions between
different events.
18. What have we gathered?
1. Defining the right context to identify anomalous
malicious events.
2. Identification of correlated events for logs
3. Transformation of anomalous behaviour.
4. Verifying with experts
19. Thanks to the attendees, support staff, open source
members of H2O, colleagues, and our clients for helping
us help them by analysing new datasets and grow H2O.
20. The Team
Mark Chan - Scientist, Engineer,
Hacker, Ninja.
Ivy Wang - UI, Problem, Details,
Details, and Details Expert.
Fonda Ingram - Comms, and
Reqs Expert, The Wall (GoT).