Self-Learning Systems for Cyber Security

1. 1/34 Self-Learning Systems for Cyber Security Ledningsregementet Enköping Kim Hammar & Rolf Stadler kimham@kth.se & stadler@kth.se Division of Network and Systems Engineering KTH Royal Institute of Technology August 18, 2021

2. 2/34

3. 3/34

4. 4/34 Challenges: Evolving and Automated Attacks I Challenges: I Evolving & automated attacks I Complex infrastructures Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31

5. 4/34 Goal: Automation and Learning I Challenges I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31

6. 4/34 Approach: Game Model & Reinforcement Learning I Challenges: I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods I Our Approach: I Model network attack and defense as games. I Use reinforcement learning to learn policies. I Incorporate learned policies in self-learning systems. Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31

7. 5/34 State of the Art I Game-Learning Programs: I TD-Gammon, AlphaGo Zero1 , OpenAI Five etc. I =⇒ Impressive empirical results of RL and self-play I Attack Simulations: I Automated threat modeling2 and intrusion detection etc. I =⇒ Need for automation and better security tooling I Mathematical Modeling: I Game theory3 I Markov decision theory, dynamic programming4 I =⇒ Many security operations involves strategic decision making 1 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2 Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi: 10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799. 3 Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA: Cambridge University Press, 2010. isbn: 0521119324. 4 Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena Scientific, 2005.

10. 6/34 Our Work I Use Case: Intrusion Prevention I Our Method: I Emulating computer infrastructures I System identification and model creation I Reinforcement learning and generalization I Results: I Learning to Capture The Flag I Learning to Prevent Attacks (Optimal Stopping) I Conclusions and Future Work

11. 7/34 Use Case: Intrusion Prevention I A Defender owns an infrastructure I Consists of connected components I Components run network services I Defender defends the infrastructure by monitoring and active defense I An Attacker seeks to intrude on the infrastructure I Has a partial view of the infrastructure I Wants to compromise specific components I Attacks by reconnaissance, exploitation and pivoting Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31

12. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems

21. 9/34 Emulation System Σ Configuration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.

22. 9/34 Emulation System Σ Configuration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.

23. 10/34 Emulation: Execution Times of Replicated Operations 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Normalized Frequency Action execution times (costs) |N| = 25 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 50 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 75 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 100 I Fundamental issue: Computational methods for policy learning typically require samples on the order of 100k − 10M. I =⇒ Infeasible to optimize in the emulation system

25. 12/34 From Emulation to Simulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)

28. 13/34 System Identification: Estimated Dynamics Model 172.18.4.2 Connections Failed Logins Accounts Online Users Logins Processes 172.18.4.3 172.18.4.10 172.18.4.21 172.18.4.79 IDS Alerts Alert Priorities Severe Alerts Warning Alerts Estimated Emulation Dynamics

29. 14/34 System Identification: Estimated Dynamics Model −5 0 5 10 15 20 # New TCP/UDP Connections 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New TCP/UDP Connections 0 10 20 30 40 50 # New Failed Login Attempts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New Failed Login Attempts −30 −20 −10 0 10 20 30 # Created User Accounts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Created User Accounts −0.04 −0.02 0.00 0.02 0.04 # New Logged in Users 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New Logged in Users −0.04 −0.02 0.00 0.02 0.04 # Login Events 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New Login Events 0 20 40 60 80 100 120 140 # Created Processes 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Created Processes Node IP: 172.18.4.2 (b0, a0) (b1, a0) ...

30. 15/34 System Identification: Estimated Dynamics Model 0 20 40 60 80 100 120 # IDS Alerts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] IDS Alerts 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 # Severe IDS Alerts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Severe IDS Alerts 0 20 40 60 80 100 # Warning IDS Alerts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Warning IDS Alerts 0 50 100 150 200 250 # IDS Alert Priorities 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] IDS Alert Priorities IDS Dynamics (b0, a0) (b1, a0) ...

32. 17/34 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=1 γt−1 rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ hPT t=1 γt−1 r(st, at) i I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ " ∇θ log πθ(a|s) | {z } actor Aπθ (s, a) | {z } critic # I Domain-Specific Challenges: I Partial observability I Large state space I Large action space I Non-stationary Environment due to attacker I Generalization Agent Environment at st+1 rt+1

36. 17/34 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E[ PT t=1 γt−1 rt+1] I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ [ PT t=1 γt−1 r(st , at )] I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ [∇θ log πθ(a|s)Aπθ (s, a)] I Domain-Specific Challenges: I Partial observability I Large state space I Large action space I Non-stationary Environment due to attacker I Generalization I Finding Effective Security Strategies through Reinforcement Learning and Self-Playa I Learning Intrusion Prevention Policies through Optimal Stoppingb a Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and Self-Play”. In: International Conference on Network and Service Management (CNSM). Izmir, Turkey, Nov. 2020. b Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv: 2106.07160 [cs.AI]. Agent Environment at st+1 rt+1

38. 19/34 The Target Infrastructure I Topology: I 30 Application Servers, 1 Gateway/IDS (Snort), 3 Clients, 1 Attacker, 1 Defender I Services I 31 SSH, 8 HTTP, 1 DNS, 1 Telnet, 2 FTP, 1 MongoDB, 2 SMTP, 2 Teamspeak 3, 22 SNMP, 12 IRC, 1 Elasticsearch, 12 NTP, 1 Samba, 19 PostgreSQL I RCE Vulnerabilities I 1 CVE-2010-0426, 1 CVE-2014-6271, 1 SQL Injection, 1 CVE-2015-3306, 1 CVE-2016-10033, 1 CVE-2015-5602, 1 CVE-2015-1427, 1 CVE-2017-7494 I 5 Brute-force vulnerabilities I Operating Systems I 23 Ubuntu-20, 1 Debian 9:2, 1 Debian Wheezy, 6 Debian Jessie, 1 Kali I Traffic I Client 1: HTTP, SSH, SNMP, ICMP I Client 2: IRC, PostgreSQL, SNMP I Client 3: FTP, DNS, Telnet Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Target infrastructure.

39. 20/34 The Attacker Model: Capture the Flag (CTF) I The attacker has T time-steps to collect flags, with no prior knowledge I It can connect to a gateway that exposes public-facing services in the infrastructure. I It has a pre-defined set (cardinality ∼ 200) of network/shell commands available, each command has a cost I To collect flags, it has to interleave reconnaissance and exploits. I Objective: collect all flags with minimum cost Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Target infrastructure.

40. 21/34 The Formal Attacker Model: A Partially Observed MDP I Model infrastructure as a graph G = hN, Ei I There are k flags at nodes C ⊆ N I Ni ∈ N has a node state si of m attributes I Network state s = {sA, si | i ∈ N} ∈ R|N|×m+|N| I Attacker observes oA ⊂ s (results of commands) I Action space: A = {aA 1 , . . . , aA k }, aA i (commands) I ∀(s, a) ∈ A × S, there is a probability ~ w A,(x) i,j of failure & a probability of detection ϕ(det(si ) · n A,(x) i,j ) I State transitions s → s0 are decided by a discrete dynamical system s0 = F(s, a) (a) Emulated Infrastructure (b) Graph Model N0, ~ S0 N1, ~ S1 N2, ~ S2 N3, ~ S3 N4, ~ S4 N5, ~ S5 N6, ~ S6 N7, ~ S7 R1 R2

41. 21/34 The Formal Attacker Model: A Partially Observed MDP I Model infrastructure as a graph G = hN, Ei I There are k flags at nodes C ⊆ N I Ni ∈ N has a node state si of m attributes I Network state s = {sA, si | i ∈ N} ∈ R|N|×m+|N| I Attacker observes oA ⊂ s (results of commands) I Action space: A = {aA 1 , . . . , aA k }, aA i (commands) I ∀(s, a) ∈ A × S, there is a probability ~ w A,(x) i,j of failure & a probability of detection ϕ(det(si ) · n A,(x) i,j ) I State transitions s → s0 are decided by a discrete dynamical system s0 = F(s, a) (a) Emulated Infrastructure (b) Graph Model N0, ~ S0 N1, ~ S1 N2, ~ S2 N3, ~ S3 N4, ~ S4 N5, ~ S5 N6, ~ S6 N7, ~ S7 R1 R2

42. 22/34 Learning to Capture the Flags: Training Attacker Policies 0 500 1000 1500 2000 2500 # Policy updates 0 25 50 75 100 % Flags captured per episode 0 500 1000 1500 2000 2500 # Policy updates 0.00 0.25 0.50 0.75 1.00 P[detected] 0 500 1000 1500 2000 2500 # Policy updates −300 −200 −100 0 100 Reward per episode 0 500 1000 1500 2000 2500 # Policy updates 0 25 50 75 100 Episode length (steps) 0 500 1000 1500 2000 2500 # Policy updates 2000 4000 6000 8000 Episode length (seconds) 0 500 1000 1500 2000 2500 # Policy updates 0 500 1000 1500 # IDS Alerts per episode πθ simulation πθ emulation upper bound Learning curves (training performance in simulation and evaluation performance in the emulation) of our proposed method.

43. 23/34 Learning Security Policies through Optimal Stopping Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that interrupt the intrusion Episode I Intrusion Prevention as Optimal Stopping Problem: I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can “stop” the intrusion. I Stopping shuts down the service provided by the infrastructure. I =⇒ trade-off two objectives: service and security I Based on the observations, when is it optimal to stop?

51. 24/34 A Partially Observed MDP Model for the Defender I States: I Intrusion state it ∈ {0, 1}, terminal state ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z. fXYZ (∆x, ∆y, ∆z|it, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ continue continue intrusion starts stop stop intrusion ends 1 20 40 60 80 100 time-step t −100 0 100 Reward function Rat st Stop reward Continue reward Intrusion starts 10 20 intrusion start time it 0.00 0.25 0.50 0.75 1.00 CDF I t (i t ) It ∼ Ge(p = 0.2)

58. 25/34 Threshold Property of the Optimal Defender Policy (1/4) Theorem The optimal policy π∗ is a threshold policy of the form: π∗ b(1) = ( S (stop) if b(1) ≥ α∗ C (continue) otherwise where α∗ is a unique threshold and b(1) = P[st = 1|a1, o1, . . . , at−1, ot]. I To see this, consider the optimality condition (Bellman eq): π∗ b(1) = arg max a∈A r b(1), a + X o∈O P[o|b(1), a]V ∗ ba o(1) #

59. 25/34 Threshold Property of the Optimal Defender Policy (1/4) Theorem The optimal policy π∗ is a threshold policy of the form: π∗ b(1) = ( S (stop) if b(1) ≥ α∗ C (continue) otherwise where α∗ is a unique threshold and b(1) = P[st = 1|a1, o1, . . . , at−1, ot]. I To see this, consider the optimality condition (Bellman eq): π∗ b(1) = arg max a∈A r b(1), a + X o∈O P[o|b(1), a]V ∗ ba o(1) #

60. 26/34 Threshold Property of the Optimal Defender Policy (2/4) I We use A = {S, C} and derive: π∗ (b(1)) = argmax a∈A r b(1), S | {z } ω , r b(1), C + X o∈O P[o|b(1), C]V ∗ bC o (1) | {z } # I ω is the expected reward for stopping and is the expected cumulative reward for continuing I Expanding the expressions and rearranging terms, we derive that it is optimal to stop iff: b(1) ≥ 110 + P o∈O V ∗ bC o (1) pZ(o, 1, C) + (1 − p)Z(o, 0, C) 300 + X o∈O V ∗ b C o (1) pZ(o, 1, C) + (1 − p)Z(o, 0, C) − Z(o, 1, C) | {z } Threshold: αb(1)

61. 26/34 Threshold Property of the Optimal Defender Policy (2/4) I We use A = {S, C} and derive: π∗ (b(1)) = argmax a∈A r b(1), S | {z } ω , r b(1), C + X o∈O P[o|b(1), C]V ∗ bC o (1) | {z } # I ω is the expected reward for stopping and is the expected cumulative reward for continuing I Expanding the expressions and rearranging terms, we derive that it is optimal to stop iff: b(1) ≥ 110 + P o∈O V ∗ bC o (1) pZ(o, 1, C) + (1 − p)Z(o, 0, C) 300 + X o∈O V ∗ b C o (1) pZ(o, 1, C) + (1 − p)Z(o, 0, C) − Z(o, 1, C) | {z } Threshold: αb(1)

62. 27/34 Threshold Property of the Optimal Defender Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex13 I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 13 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635.

63. 27/34 Threshold Property of the Optimal Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex14 I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 14 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635.

64. 27/34 Threshold Property of the Optimal Defender Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex15 I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 15 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635.

65. 27/34 Threshold Property of the Optimal Defender Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex16, S is also convex17 and has the form [α∗, β∗] where 0 ≤ α∗ ≤ β∗ ≤ 1. I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 16 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635. 17 Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing. Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.

66. 27/34 Threshold Property of the Optimal Defender Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex18, S is also convex19 and has the form [α∗, β∗] where 0 ≤ α∗ ≤ β∗ ≤ 1. I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 18 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635. 19 Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing. Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.

67. 28/34 Threshold Property of the Optimal Defender Policy (4/4) I As the stopping set is S = [α∗, 1] and b(1) ∈ [0, 1] I We have that it is optimal to stop if b(1) ≥ α∗ I Hence, Theorem 1 follows: π∗ b(1) = ( S (stop) if b(1) ≥ α∗ C (continue) otherwise

68. 28/34 Threshold Property of the Optimal Defender Policy (4/4) I As the stopping set is S = [α∗, 1] and b(1) ∈ [0, 1] I We have that it is optimal to stop if b(1) ≥ α∗ I Hence, Theorem 1 follows: π∗ b(1) = ( S (stop) if b(1) ≥ α∗ C (continue) otherwise

69. 29/34 Static Attackers to Emulate Intrusions Time-steps t Actions 1–It ∼ Ge(0.2) (Intrusion has not started) It + 1–It + 7 Recon, brute-force attacks (SSH,Telnet,FTP) on N2, N4, N10, login(N2, N4, N10), backdoor(N2, N4, N10), Recon It + 8–It + 11 CVE-2014-6271 on N17, SSH brute-force attack on N12, login (N17, N12), backdoor(N17, N12) It + 12–X + 16 CVE-2010-0426 exploit on N12, Recon SQL-Injection on N18, login(N18), backdoor(N18) It + 17–It + 22 Recon, CVE-2015-1427 on N25, login(N25) Recon, CVE-2017-7494 exploit on N27, login(N27) Table 1: Attacker actions to emulate an intrusion.

70. 30/34 Learning Security Policies through Optimal Stopping 0 1000 2000 3000 4000 # policy updates −100 0 100 200 Reward per episode 0 1000 2000 3000 4000 # policy updates 2 4 6 8 Episode length (steps) 0 1000 2000 3000 4000 # policy updates 0.2 0.4 0.6 0.8 1.0 P[intrusion interrupted] 0 1000 2000 3000 4000 # policy updates 0.0 0.2 0.4 0.6 0.8 1.0 P[early stopping] 0 1000 2000 3000 4000 # policy updates 1 2 3 4 Uninterrupted intrusion t Learned πθ vs NoisyAttacker Learned πθ vs StealthyAttacker t = 6 baseline (x + y) ≥ 1 baseline Upper bound π∗ Learning curves of training defender policies against static attackers.

71. 31/34 Threshold Properties of the Learned Policies 0 100 200 300 400 # total alerts x + y 0.0 0.5 1.0 πθ(stop|x + y) πθ vs StealthyAttacker πθ vs NoisyAttacker 0 25 50 75 100 # login attempts z 0.00 0.05 0.10 0.15 πθ(stop|z) πθ vs StealthyAttacker πθ vs NoisyAttacker 0 50 100 150 200 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 πθ(stop|x, y) vs StealthyAttacker 0 50 100 150 200 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 πθ(stop|x, y) vs NoisyAttacker soft thresholds never stop severe alerts x warning alerts y warning alerts y severe alerts x

72. 32/34 Open Challenge: Self-Play between Attacker and Defender 0 500 1000 1500 2000 2500 # Policy updates 0 20 40 60 80 % Flags captured per game 0 500 1000 1500 2000 2500 # Policy updates 0.00 0.25 0.50 0.75 1.00 P[detected] 0 500 1000 1500 2000 2500 # Policy updates −100 0 100 Reward per game 0 500 1000 1500 2000 2500 # Policy updates 0 50 100 150 Game length (steps) 0 500 1000 1500 2000 2500 # Policy updates 0.00 0.25 0.50 0.75 1.00 P[early stopping] 0 500 1000 1500 2000 2500 # Policy updates 0 100 200 300 # IDS Alerts per game Attacker πθA simulation Attacker πθA emulation Defender πθD simulation Defender πθD emulation Learning curves of training the the attacker and the defender simultaneously in self-play.

73. 33/34 Conclusions Future Work I Conclusions: I We develop a method to find effective strategies for intrusion prevention I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement learning and (5) domain randomization and generalization. I We show that self-learning can be successfully applied to network infrastructures. I Self-play reinforcement learning in Markov security game I Key challenges: stable convergence, sample efficiency, complexity of emulations, large state and action spaces, theoretical understanding of optimal policies I Our research plans: I Extending the theoretical model I Relaxing simplifying assumptions (e.g. multiple defender actions) I Evaluation on real world infrastructures

74. 34/34 References I Finding Effective Security Strategies through Reinforcement Learning and Self-Play20 I Preprint open access: https://arxiv.org/abs/2009.08120 I Learning Intrusion Prevention Policies through Optimal Stopping21 I Preprint open access: https://arxiv.org/pdf/2106.07160.pdf 20 Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and Self-Play”. In: International Conference on Network and Service Management (CNSM). Izmir, Turkey, Nov. 2020. 21 Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv: 2106.07160 [cs.AI].

Self-Learning Systems for Cyber Security

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

Ähnlich wie Self-Learning Systems for Cyber Security

Ähnlich wie Self-Learning Systems for Cyber Security (20)

Mehr von Kim Hammar

Mehr von Kim Hammar (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Self-Learning Systems for Cyber Security