The Ultimate Guide to Choosing WordPress Pros and Cons
Self-Learning Systems for Cyber Security
1. 1/34
Self-Learning Systems for Cyber Security
Ledningsregementet Enköping
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
August 18, 2021
5. 4/34
Goal: Automation and Learning
I Challenges
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
6. 4/34
Approach: Game Model & Reinforcement Learning
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
I Our Approach:
I Model network attack and defense as
games.
I Use reinforcement learning to learn
policies.
I Incorporate learned policies in
self-learning systems.
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
7. 5/34
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero1
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling2
and intrusion detection etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory3
I Markov decision theory, dynamic programming4
I =⇒ Many security operations involves
strategic decision making
1
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
3
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
4
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
Scientific, 2005.
8. 5/34
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero5
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling6
and intrusion detection etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory7
I Markov decision theory, dynamic programming8
I =⇒ Many security operations involves
strategic decision making
5
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
6
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
7
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
8
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
Scientific, 2005.
9. 5/34
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero9
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling10
and intrusion detection etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory11
I Markov decision theory, dynamic programming12
I =⇒ Many security operations involves
strategic decision making
9
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
10
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
11
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
12
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
Scientific, 2005.
10. 6/34
Our Work
I Use Case: Intrusion Prevention
I Our Method:
I Emulating computer infrastructures
I System identification and model creation
I Reinforcement learning and generalization
I Results:
I Learning to Capture The Flag
I Learning to Prevent Attacks (Optimal Stopping)
I Conclusions and Future Work
11. 7/34
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
12. 8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
13. 8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
14. 8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
15. 8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
16. 8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17. 8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
18. 8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
19. 8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
20. 8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
21. 9/34
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
22. 9/34
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
23. 10/34
Emulation: Execution Times of Replicated Operations
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Normalized
Frequency
Action execution times (costs)
|N| = 25
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 50
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 75
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 100
I Fundamental issue: Computational methods for policy
learning typically require samples on the order of 100k − 10M.
I =⇒ Infeasible to optimize in the emulation system
24. 11/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
25. 12/34
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k
m3,1
.
.
.
m3,k
m7,1
.
.
.
m7,k
m4,1
.
.
.
m4,k
Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
26. 12/34
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k
m3,1
.
.
.
m3,k
m7,1
.
.
.
m7,k
m4,1
.
.
.
m4,k
Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
27. 12/34
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k
m3,1
.
.
.
m3,k
m7,1
.
.
.
m7,k
m4,1
.
.
.
m4,k
Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
29. 14/34
System Identification: Estimated Dynamics Model
−5 0 5 10 15 20
# New TCP/UDP Connections
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New TCP/UDP Connections
0 10 20 30 40 50
# New Failed Login Attempts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Failed Login Attempts
−30 −20 −10 0 10 20 30
# Created User Accounts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created User Accounts
−0.04 −0.02 0.00 0.02 0.04
# New Logged in Users
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Logged in Users
−0.04 −0.02 0.00 0.02 0.04
# Login Events
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Login Events
0 20 40 60 80 100 120 140
# Created Processes
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created Processes
Node IP: 172.18.4.2
(b0, a0) (b1, a0) ...
30. 15/34
System Identification: Estimated Dynamics Model
0 20 40 60 80 100 120
# IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alerts
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
# Severe IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Severe IDS Alerts
0 20 40 60 80 100
# Warning IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Warning IDS Alerts
0 50 100 150 200 250
# IDS Alert Priorities
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alert Priorities
IDS Dynamics
(b0, a0) (b1, a0) ...
31. 16/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
32. 17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
"
∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
Agent
Environment
at
st+1
rt+1
33. 17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
"
∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
Agent
Environment
at
st+1
rt+1
34. 17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
"
∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
Agent
Environment
at
st+1
rt+1
35. 17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
"
∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
Agent
Environment
at
st+1
rt+1
36. 17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E[
PT
t=1
γt−1
rt+1]
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
[
PT
t=1
γt−1
r(st , at )]
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ [∇θ log πθ(a|s)Aπθ (s, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
I Finding Effective Security Strategies through Reinforcement
Learning and Self-Playa
I Learning Intrusion Prevention Policies through Optimal
Stoppingb
a
Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and
Self-Play”. In: International Conference on Network and Service Management (CNSM). Izmir, Turkey, Nov. 2020.
b
Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv:
2106.07160 [cs.AI].
Agent
Environment
at
st+1
rt+1
37. 18/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
38. 19/34
The Target Infrastructure
I Topology:
I 30 Application Servers, 1 Gateway/IDS (Snort), 3 Clients, 1 Attacker,
1 Defender
I Services
I 31 SSH, 8 HTTP, 1 DNS, 1 Telnet, 2 FTP, 1 MongoDB, 2 SMTP, 2
Teamspeak 3, 22 SNMP, 12 IRC, 1 Elasticsearch, 12 NTP, 1 Samba,
19 PostgreSQL
I RCE Vulnerabilities
I 1 CVE-2010-0426, 1 CVE-2014-6271, 1 SQL Injection, 1
CVE-2015-3306, 1 CVE-2016-10033, 1 CVE-2015-5602, 1
CVE-2015-1427, 1 CVE-2017-7494
I 5 Brute-force vulnerabilities
I Operating Systems
I 23 Ubuntu-20, 1 Debian 9:2, 1 Debian Wheezy, 6 Debian Jessie, 1
Kali
I Traffic
I Client 1: HTTP, SSH, SNMP, ICMP
I Client 2: IRC, PostgreSQL, SNMP
I Client 3: FTP, DNS, Telnet
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Target infrastructure.
39. 20/34
The Attacker Model: Capture the Flag (CTF)
I The attacker has T time-steps to collect
flags, with no prior knowledge
I It can connect to a gateway that exposes
public-facing services in the infrastructure.
I It has a pre-defined set (cardinality ∼ 200)
of network/shell commands available,
each command has a cost
I To collect flags, it has to interleave
reconnaissance and exploits.
I Objective: collect all flags with minimum cost
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Target infrastructure.
40. 21/34
The Formal Attacker Model: A Partially Observed MDP
I Model infrastructure as a graph G = hN, Ei
I There are k flags at nodes C ⊆ N
I Ni ∈ N has a node state si of m attributes
I Network state
s = {sA, si | i ∈ N} ∈ R|N|×m+|N|
I Attacker observes oA ⊂ s (results of
commands)
I Action space: A = {aA
1 , . . . , aA
k }, aA
i
(commands)
I ∀(s, a) ∈ A × S, there is a probability ~
w
A,(x)
i,j
of failure & a probability of detection
ϕ(det(si ) · n
A,(x)
i,j )
I State transitions s → s0 are decided by a
discrete dynamical system s0 = F(s, a)
(a) Emulated Infrastructure
(b) Graph Model
N0, ~
S0
N1, ~
S1
N2, ~
S2
N3, ~
S3
N4, ~
S4
N5, ~
S5
N6, ~
S6
N7, ~
S7
R1
R2
41. 21/34
The Formal Attacker Model: A Partially Observed MDP
I Model infrastructure as a graph G = hN, Ei
I There are k flags at nodes C ⊆ N
I Ni ∈ N has a node state si of m attributes
I Network state
s = {sA, si | i ∈ N} ∈ R|N|×m+|N|
I Attacker observes oA ⊂ s (results of
commands)
I Action space: A = {aA
1 , . . . , aA
k }, aA
i
(commands)
I ∀(s, a) ∈ A × S, there is a probability ~
w
A,(x)
i,j
of failure & a probability of detection
ϕ(det(si ) · n
A,(x)
i,j )
I State transitions s → s0 are decided by a
discrete dynamical system s0 = F(s, a)
(a) Emulated Infrastructure
(b) Graph Model
N0, ~
S0
N1, ~
S1
N2, ~
S2
N3, ~
S3
N4, ~
S4
N5, ~
S5
N6, ~
S6
N7, ~
S7
R1
R2
43. 23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
44. 23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
45. 23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
46. 23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
47. 23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
48. 23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
49. 23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
50. 23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
51. 24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
52. 24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
53. 24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
54. 24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
55. 24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
56. 24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
57. 24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
58. 25/34
Threshold Property of the Optimal Defender Policy (1/4)
Theorem
The optimal policy π∗ is a threshold policy of the form:
π∗
b(1)
=
(
S (stop) if b(1) ≥ α∗
C (continue) otherwise
where α∗ is a unique threshold and
b(1) = P[st = 1|a1, o1, . . . , at−1, ot].
I To see this, consider the optimality condition (Bellman eq):
π∗
b(1)
= arg max
a∈A
r b(1), a
+
X
o∈O
P[o|b(1), a]V ∗
ba
o(1)
#
59. 25/34
Threshold Property of the Optimal Defender Policy (1/4)
Theorem
The optimal policy π∗ is a threshold policy of the form:
π∗
b(1)
=
(
S (stop) if b(1) ≥ α∗
C (continue) otherwise
where α∗ is a unique threshold and
b(1) = P[st = 1|a1, o1, . . . , at−1, ot].
I To see this, consider the optimality condition (Bellman eq):
π∗
b(1)
= arg max
a∈A
r b(1), a
+
X
o∈O
P[o|b(1), a]V ∗
ba
o(1)
#
60. 26/34
Threshold Property of the Optimal Defender Policy (2/4)
I We use A = {S, C} and derive:
π∗
(b(1)) = argmax
a∈A
r b(1), S
| {z }
ω
, r b(1), C
+
X
o∈O
P[o|b(1), C]V ∗
bC
o (1)
| {z }
#
I ω is the expected reward for stopping and is the expected
cumulative reward for continuing
I Expanding the expressions and rearranging terms, we derive
that it is optimal to stop iff:
b(1) ≥
110 +
P
o∈O
V ∗
bC
o (1)
pZ(o, 1, C) + (1 − p)Z(o, 0, C)
300 +
X
o∈O
V
∗
b
C
o (1)
pZ(o, 1, C) + (1 − p)Z(o, 0, C) − Z(o, 1, C)
| {z }
Threshold: αb(1)
61. 26/34
Threshold Property of the Optimal Defender Policy (2/4)
I We use A = {S, C} and derive:
π∗
(b(1)) = argmax
a∈A
r b(1), S
| {z }
ω
, r b(1), C
+
X
o∈O
P[o|b(1), C]V ∗
bC
o (1)
| {z }
#
I ω is the expected reward for stopping and is the expected
cumulative reward for continuing
I Expanding the expressions and rearranging terms, we derive
that it is optimal to stop iff:
b(1) ≥
110 +
P
o∈O
V ∗
bC
o (1)
pZ(o, 1, C) + (1 − p)Z(o, 0, C)
300 +
X
o∈O
V
∗
b
C
o (1)
pZ(o, 1, C) + (1 − p)Z(o, 0, C) − Z(o, 1, C)
| {z }
Threshold: αb(1)
62. 27/34
Threshold Property of the Optimal Defender Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex13
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max
100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)
#
= S
I This means that β∗ = 1
13
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
63. 27/34
Threshold Property of the Optimal Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex14
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max
100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)
#
= S
I This means that β∗ = 1
14
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
64. 27/34
Threshold Property of the Optimal Defender Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex15
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max
100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)
#
= S
I This means that β∗ = 1
15
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
65. 27/34
Threshold Property of the Optimal Defender Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex16, S is also
convex17 and has the form [α∗, β∗] where 0 ≤ α∗ ≤ β∗ ≤ 1.
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max
100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)
#
= S
I This means that β∗ = 1
16
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
17
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
66. 27/34
Threshold Property of the Optimal Defender Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex18, S is also
convex19 and has the form [α∗, β∗] where 0 ≤ α∗ ≤ β∗ ≤ 1.
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max
100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)
#
= S
I This means that β∗ = 1
18
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
19
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
67. 28/34
Threshold Property of the Optimal Defender Policy (4/4)
I As the stopping set is S = [α∗, 1] and b(1) ∈ [0, 1]
I We have that it is optimal to stop if b(1) ≥ α∗
I Hence, Theorem 1 follows:
π∗
b(1)
=
(
S (stop) if b(1) ≥ α∗
C (continue) otherwise
68. 28/34
Threshold Property of the Optimal Defender Policy (4/4)
I As the stopping set is S = [α∗, 1] and b(1) ∈ [0, 1]
I We have that it is optimal to stop if b(1) ≥ α∗
I Hence, Theorem 1 follows:
π∗
b(1)
=
(
S (stop) if b(1) ≥ α∗
C (continue) otherwise
69. 29/34
Static Attackers to Emulate Intrusions
Time-steps t Actions
1–It ∼ Ge(0.2) (Intrusion has not started)
It + 1–It + 7 Recon, brute-force attacks (SSH,Telnet,FTP)
on N2, N4, N10, login(N2, N4, N10),
backdoor(N2, N4, N10), Recon
It + 8–It + 11 CVE-2014-6271 on N17, SSH brute-force attack on N12,
login (N17, N12), backdoor(N17, N12)
It + 12–X + 16 CVE-2010-0426 exploit on N12, Recon
SQL-Injection on N18, login(N18), backdoor(N18)
It + 17–It + 22 Recon, CVE-2015-1427 on N25, login(N25)
Recon, CVE-2017-7494 exploit on N27, login(N27)
Table 1: Attacker actions to emulate an intrusion.
71. 31/34
Threshold Properties of the Learned Policies
0 100 200 300 400
# total alerts x + y
0.0
0.5
1.0
πθ(stop|x + y)
πθ vs StealthyAttacker
πθ vs NoisyAttacker
0 25 50 75 100
# login attempts z
0.00
0.05
0.10
0.15
πθ(stop|z)
πθ vs StealthyAttacker
πθ vs NoisyAttacker
0
50
100
150
200 0
50
100
150
200
0.0
0.2
0.4
0.6
0.8
1.0
πθ(stop|x, y) vs StealthyAttacker
0
50
100
150
200 0
50
100
150
200
0.0
0.2
0.4
0.6
0.8
πθ(stop|x, y) vs NoisyAttacker
soft
thresholds
never
stop
severe alerts x
warning alerts y
warning alerts y severe alerts x
72. 32/34
Open Challenge: Self-Play between Attacker and Defender
0 500 1000 1500 2000 2500
# Policy updates
0
20
40
60
80
% Flags captured per game
0 500 1000 1500 2000 2500
# Policy updates
0.00
0.25
0.50
0.75
1.00
P[detected]
0 500 1000 1500 2000 2500
# Policy updates
−100
0
100
Reward per game
0 500 1000 1500 2000 2500
# Policy updates
0
50
100
150
Game length (steps)
0 500 1000 1500 2000 2500
# Policy updates
0.00
0.25
0.50
0.75
1.00
P[early stopping]
0 500 1000 1500 2000 2500
# Policy updates
0
100
200
300
# IDS Alerts per game
Attacker πθA simulation Attacker πθA emulation Defender πθD simulation Defender πθD emulation
Learning curves of training the the attacker and the defender
simultaneously in self-play.
73. 33/34
Conclusions Future Work
I Conclusions:
I We develop a method to find effective strategies for intrusion
prevention
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We show that self-learning can be successfully applied to
network infrastructures.
I Self-play reinforcement learning in Markov security game
I Key challenges: stable convergence, sample efficiency,
complexity of emulations, large state and action spaces,
theoretical understanding of optimal policies
I Our research plans:
I Extending the theoretical model
I Relaxing simplifying assumptions (e.g. multiple defender actions)
I Evaluation on real world infrastructures
74. 34/34
References
I Finding Effective Security Strategies through Reinforcement
Learning and Self-Play20
I Preprint open access:
https://arxiv.org/abs/2009.08120
I Learning Intrusion Prevention Policies through Optimal
Stopping21
I Preprint open access:
https://arxiv.org/pdf/2106.07160.pdf
20
Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and
Self-Play”. In: International Conference on Network and Service Management (CNSM). Izmir, Turkey, Nov. 2020.
21
Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv:
2106.07160 [cs.AI].