SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Downloaden Sie, um offline zu lesen
1/34
Self-Learning Systems for Cyber Security
Ledningsregementet Enköping
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
August 18, 2021
2/34
3/34
4/34
Challenges: Evolving and Automated Attacks
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
4/34
Goal: Automation and Learning
I Challenges
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
4/34
Approach: Game Model & Reinforcement Learning
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
I Our Approach:
I Model network attack and defense as
games.
I Use reinforcement learning to learn
policies.
I Incorporate learned policies in
self-learning systems.
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
5/34
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero1
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling2
and intrusion detection etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory3
I Markov decision theory, dynamic programming4
I =⇒ Many security operations involves
strategic decision making
1
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
3
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
4
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
Scientific, 2005.
5/34
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero5
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling6
and intrusion detection etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory7
I Markov decision theory, dynamic programming8
I =⇒ Many security operations involves
strategic decision making
5
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
6
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
7
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
8
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
Scientific, 2005.
5/34
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero9
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling10
and intrusion detection etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory11
I Markov decision theory, dynamic programming12
I =⇒ Many security operations involves
strategic decision making
9
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
10
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
11
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
12
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
Scientific, 2005.
6/34
Our Work
I Use Case: Intrusion Prevention
I Our Method:
I Emulating computer infrastructures
I System identification and model creation
I Reinforcement learning and generalization
I Results:
I Learning to Capture The Flag
I Learning to Prevent Attacks (Optimal Stopping)
I Conclusions and Future Work
7/34
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
9/34
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
9/34
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
10/34
Emulation: Execution Times of Replicated Operations
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Normalized
Frequency
Action execution times (costs)
|N| = 25
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 50
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 75
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 100
I Fundamental issue: Computational methods for policy
learning typically require samples on the order of 100k − 10M.
I =⇒ Infeasible to optimize in the emulation system
11/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
12/34
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
12/34
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
12/34
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
13/34
System Identification: Estimated Dynamics Model
172.18.4.2
Connections Failed Logins Accounts Online Users Logins Processes
172.18.4.3
172.18.4.10
172.18.4.21
172.18.4.79
IDS
Alerts Alert Priorities Severe Alerts Warning Alerts
Estimated Emulation Dynamics
14/34
System Identification: Estimated Dynamics Model
−5 0 5 10 15 20
# New TCP/UDP Connections
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New TCP/UDP Connections
0 10 20 30 40 50
# New Failed Login Attempts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Failed Login Attempts
−30 −20 −10 0 10 20 30
# Created User Accounts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created User Accounts
−0.04 −0.02 0.00 0.02 0.04
# New Logged in Users
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Logged in Users
−0.04 −0.02 0.00 0.02 0.04
# Login Events
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Login Events
0 20 40 60 80 100 120 140
# Created Processes
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created Processes
Node IP: 172.18.4.2
(b0, a0) (b1, a0) ...
15/34
System Identification: Estimated Dynamics Model
0 20 40 60 80 100 120
# IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alerts
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
# Severe IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Severe IDS Alerts
0 20 40 60 80 100
# Warning IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Warning IDS Alerts
0 50 100 150 200 250
# IDS Alert Priorities
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alert Priorities
IDS Dynamics
(b0, a0) (b1, a0) ...
16/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
"
∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
Agent
Environment
at
st+1
rt+1
17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
"
∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
Agent
Environment
at
st+1
rt+1
17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
"
∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
Agent
Environment
at
st+1
rt+1
17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
"
∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
Agent
Environment
at
st+1
rt+1
17/34
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E[
PT
t=1
γt−1
rt+1]
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
[
PT
t=1
γt−1
r(st , at )]
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ [∇θ log πθ(a|s)Aπθ (s, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space
I Large action space
I Non-stationary Environment due to attacker
I Generalization
I Finding Effective Security Strategies through Reinforcement
Learning and Self-Playa
I Learning Intrusion Prevention Policies through Optimal
Stoppingb
a
Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and
Self-Play”. In: International Conference on Network and Service Management (CNSM). Izmir, Turkey, Nov. 2020.
b
Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv:
2106.07160 [cs.AI].
Agent
Environment
at
st+1
rt+1
18/34
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
19/34
The Target Infrastructure
I Topology:
I 30 Application Servers, 1 Gateway/IDS (Snort), 3 Clients, 1 Attacker,
1 Defender
I Services
I 31 SSH, 8 HTTP, 1 DNS, 1 Telnet, 2 FTP, 1 MongoDB, 2 SMTP, 2
Teamspeak 3, 22 SNMP, 12 IRC, 1 Elasticsearch, 12 NTP, 1 Samba,
19 PostgreSQL
I RCE Vulnerabilities
I 1 CVE-2010-0426, 1 CVE-2014-6271, 1 SQL Injection, 1
CVE-2015-3306, 1 CVE-2016-10033, 1 CVE-2015-5602, 1
CVE-2015-1427, 1 CVE-2017-7494
I 5 Brute-force vulnerabilities
I Operating Systems
I 23 Ubuntu-20, 1 Debian 9:2, 1 Debian Wheezy, 6 Debian Jessie, 1
Kali
I Traffic
I Client 1: HTTP, SSH, SNMP, ICMP
I Client 2: IRC, PostgreSQL, SNMP
I Client 3: FTP, DNS, Telnet
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Target infrastructure.
20/34
The Attacker Model: Capture the Flag (CTF)
I The attacker has T time-steps to collect
flags, with no prior knowledge
I It can connect to a gateway that exposes
public-facing services in the infrastructure.
I It has a pre-defined set (cardinality ∼ 200)
of network/shell commands available,
each command has a cost
I To collect flags, it has to interleave
reconnaissance and exploits.
I Objective: collect all flags with minimum cost
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Target infrastructure.
21/34
The Formal Attacker Model: A Partially Observed MDP
I Model infrastructure as a graph G = hN, Ei
I There are k flags at nodes C ⊆ N
I Ni ∈ N has a node state si of m attributes
I Network state
s = {sA, si | i ∈ N} ∈ R|N|×m+|N|
I Attacker observes oA ⊂ s (results of
commands)
I Action space: A = {aA
1 , . . . , aA
k }, aA
i
(commands)
I ∀(s, a) ∈ A × S, there is a probability ~
w
A,(x)
i,j
of failure & a probability of detection
ϕ(det(si ) · n
A,(x)
i,j )
I State transitions s → s0 are decided by a
discrete dynamical system s0 = F(s, a)
(a) Emulated Infrastructure
(b) Graph Model
N0, ~
S0
N1, ~
S1
N2, ~
S2
N3, ~
S3
N4, ~
S4
N5, ~
S5
N6, ~
S6
N7, ~
S7
R1
R2
21/34
The Formal Attacker Model: A Partially Observed MDP
I Model infrastructure as a graph G = hN, Ei
I There are k flags at nodes C ⊆ N
I Ni ∈ N has a node state si of m attributes
I Network state
s = {sA, si | i ∈ N} ∈ R|N|×m+|N|
I Attacker observes oA ⊂ s (results of
commands)
I Action space: A = {aA
1 , . . . , aA
k }, aA
i
(commands)
I ∀(s, a) ∈ A × S, there is a probability ~
w
A,(x)
i,j
of failure & a probability of detection
ϕ(det(si ) · n
A,(x)
i,j )
I State transitions s → s0 are decided by a
discrete dynamical system s0 = F(s, a)
(a) Emulated Infrastructure
(b) Graph Model
N0, ~
S0
N1, ~
S1
N2, ~
S2
N3, ~
S3
N4, ~
S4
N5, ~
S5
N6, ~
S6
N7, ~
S7
R1
R2
22/34
Learning to Capture the Flags: Training Attacker Policies
0 500 1000 1500 2000 2500
# Policy updates
0
25
50
75
100
% Flags captured per episode
0 500 1000 1500 2000 2500
# Policy updates
0.00
0.25
0.50
0.75
1.00
P[detected]
0 500 1000 1500 2000 2500
# Policy updates
−300
−200
−100
0
100
Reward per episode
0 500 1000 1500 2000 2500
# Policy updates
0
25
50
75
100
Episode length (steps)
0 500 1000 1500 2000 2500
# Policy updates
2000
4000
6000
8000
Episode length (seconds)
0 500 1000 1500 2000 2500
# Policy updates
0
500
1000
1500
# IDS Alerts per episode
πθ simulation πθ emulation upper bound
Learning curves (training performance in simulation and evaluation
performance in the emulation) of our proposed method.
23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
23/34
Learning Security Policies through Optimal Stopping
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
interrupt the intrusion
Episode
I Intrusion Prevention as Optimal Stopping Problem:
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can “stop” the intrusion.
I Stopping shuts down the service provided by the infrastructure.
I =⇒ trade-off two objectives: service and security
I Based on the observations, when is it optimal to stop?
24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
24/34
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state it ∈ {0, 1}, terminal
state ∅.
I Observations:
I Severe/Warning IDS Alerts
(∆x, ∆y), Login attempts ∆z.
fXYZ (∆x, ∆y, ∆z|it, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service.
Penalty: false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
continue continue
intrusion
starts
stop stop intrusion
ends
1 20 40 60 80 100
time-step t
−100
0
100
Reward function Rat
st
Stop reward
Continue reward
Intrusion starts
10 20
intrusion start time it
0.00
0.25
0.50
0.75
1.00
CDF
I
t
(i
t
)
It ∼ Ge(p = 0.2)
25/34
Threshold Property of the Optimal Defender Policy (1/4)
Theorem
The optimal policy π∗ is a threshold policy of the form:
π∗
b(1)

=
(
S (stop) if b(1) ≥ α∗
C (continue) otherwise
where α∗ is a unique threshold and
b(1) = P[st = 1|a1, o1, . . . , at−1, ot].
I To see this, consider the optimality condition (Bellman eq):
π∗
b(1)

= arg max
a∈A

r b(1), a

+
X
o∈O
P[o|b(1), a]V ∗
ba
o(1)

#
25/34
Threshold Property of the Optimal Defender Policy (1/4)
Theorem
The optimal policy π∗ is a threshold policy of the form:
π∗
b(1)

=
(
S (stop) if b(1) ≥ α∗
C (continue) otherwise
where α∗ is a unique threshold and
b(1) = P[st = 1|a1, o1, . . . , at−1, ot].
I To see this, consider the optimality condition (Bellman eq):
π∗
b(1)

= arg max
a∈A

r b(1), a

+
X
o∈O
P[o|b(1), a]V ∗
ba
o(1)

#
26/34
Threshold Property of the Optimal Defender Policy (2/4)
I We use A = {S, C} and derive:
π∗
(b(1)) = argmax
a∈A

r b(1), S

| {z }
ω
, r b(1), C

+
X
o∈O
P[o|b(1), C]V ∗
bC
o (1)

| {z }

#
I ω is the expected reward for stopping and  is the expected
cumulative reward for continuing
I Expanding the expressions and rearranging terms, we derive
that it is optimal to stop iff:
b(1) ≥
110 +
P
o∈O
V ∗
bC
o (1)

pZ(o, 1, C) + (1 − p)Z(o, 0, C)

300 +
X
o∈O
V
∗
b
C
o (1)

pZ(o, 1, C) + (1 − p)Z(o, 0, C) − Z(o, 1, C)

| {z }
Threshold: αb(1)
26/34
Threshold Property of the Optimal Defender Policy (2/4)
I We use A = {S, C} and derive:
π∗
(b(1)) = argmax
a∈A

r b(1), S

| {z }
ω
, r b(1), C

+
X
o∈O
P[o|b(1), C]V ∗
bC
o (1)

| {z }

#
I ω is the expected reward for stopping and  is the expected
cumulative reward for continuing
I Expanding the expressions and rearranging terms, we derive
that it is optimal to stop iff:
b(1) ≥
110 +
P
o∈O
V ∗
bC
o (1)

pZ(o, 1, C) + (1 − p)Z(o, 0, C)

300 +
X
o∈O
V
∗
b
C
o (1)

pZ(o, 1, C) + (1 − p)Z(o, 0, C) − Z(o, 1, C)

| {z }
Threshold: αb(1)
27/34
Threshold Property of the Optimal Defender Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex13
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max

100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)

#
= S
I This means that β∗ = 1
13
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
27/34
Threshold Property of the Optimal Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex14
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max

100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)

#
= S
I This means that β∗ = 1
14
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
27/34
Threshold Property of the Optimal Defender Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex15
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max

100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)

#
= S
I This means that β∗ = 1
15
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
27/34
Threshold Property of the Optimal Defender Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex16, S is also
convex17 and has the form [α∗, β∗] where 0 ≤ α∗ ≤ β∗ ≤ 1.
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max

100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)

#
= S
I This means that β∗ = 1
16
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
17
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
27/34
Threshold Property of the Optimal Defender Policy (3/4)
I Thus π∗ is determined by the scalar thresholds αb(1).
I it is optimal to stop if b(1) ≥ αb(1)
I The stopping set is:
S =
n
b(1) ∈ [0, 1] : b(1) ≥ αb(1)
o
I Since V ∗(b) is piecewise linear and convex18, S is also
convex19 and has the form [α∗, β∗] where 0 ≤ α∗ ≤ β∗ ≤ 1.
I When b(1) = 1 it is optimal to take the stop action S:
π∗
(1) = arg max

100, −90 +
X
o∈O
Z(o, 1, C)V ∗
bC
o (1)

#
= S
I This means that β∗ = 1
18
Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon:
Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url:
http://www.jstor.org/stable/169635.
19
Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing.
Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
28/34
Threshold Property of the Optimal Defender Policy (4/4)
I As the stopping set is S = [α∗, 1] and b(1) ∈ [0, 1]
I We have that it is optimal to stop if b(1) ≥ α∗
I Hence, Theorem 1 follows:
π∗
b(1)

=
(
S (stop) if b(1) ≥ α∗
C (continue) otherwise
28/34
Threshold Property of the Optimal Defender Policy (4/4)
I As the stopping set is S = [α∗, 1] and b(1) ∈ [0, 1]
I We have that it is optimal to stop if b(1) ≥ α∗
I Hence, Theorem 1 follows:
π∗
b(1)

=
(
S (stop) if b(1) ≥ α∗
C (continue) otherwise
29/34
Static Attackers to Emulate Intrusions
Time-steps t Actions
1–It ∼ Ge(0.2) (Intrusion has not started)
It + 1–It + 7 Recon, brute-force attacks (SSH,Telnet,FTP)
on N2, N4, N10, login(N2, N4, N10),
backdoor(N2, N4, N10), Recon
It + 8–It + 11 CVE-2014-6271 on N17, SSH brute-force attack on N12,
login (N17, N12), backdoor(N17, N12)
It + 12–X + 16 CVE-2010-0426 exploit on N12, Recon
SQL-Injection on N18, login(N18), backdoor(N18)
It + 17–It + 22 Recon, CVE-2015-1427 on N25, login(N25)
Recon, CVE-2017-7494 exploit on N27, login(N27)
Table 1: Attacker actions to emulate an intrusion.
30/34
Learning Security Policies through Optimal Stopping
0 1000 2000 3000 4000
# policy updates
−100
0
100
200
Reward per episode
0 1000 2000 3000 4000
# policy updates
2
4
6
8
Episode length (steps)
0 1000 2000 3000 4000
# policy updates
0.2
0.4
0.6
0.8
1.0
P[intrusion interrupted]
0 1000 2000 3000 4000
# policy updates
0.0
0.2
0.4
0.6
0.8
1.0
P[early stopping]
0 1000 2000 3000 4000
# policy updates
1
2
3
4
Uninterrupted intrusion t
Learned πθ vs NoisyAttacker Learned πθ vs StealthyAttacker t = 6 baseline (x + y) ≥ 1 baseline Upper bound π∗
Learning curves of training defender policies against static attackers.
31/34
Threshold Properties of the Learned Policies
0 100 200 300 400
# total alerts x + y
0.0
0.5
1.0
πθ(stop|x + y)
πθ vs StealthyAttacker
πθ vs NoisyAttacker
0 25 50 75 100
# login attempts z
0.00
0.05
0.10
0.15
πθ(stop|z)
πθ vs StealthyAttacker
πθ vs NoisyAttacker
0
50
100
150
200 0
50
100
150
200
0.0
0.2
0.4
0.6
0.8
1.0
πθ(stop|x, y) vs StealthyAttacker
0
50
100
150
200 0
50
100
150
200
0.0
0.2
0.4
0.6
0.8
πθ(stop|x, y) vs NoisyAttacker
soft
thresholds
never
stop
severe alerts x
warning alerts y
warning alerts y severe alerts x
32/34
Open Challenge: Self-Play between Attacker and Defender
0 500 1000 1500 2000 2500
# Policy updates
0
20
40
60
80
% Flags captured per game
0 500 1000 1500 2000 2500
# Policy updates
0.00
0.25
0.50
0.75
1.00
P[detected]
0 500 1000 1500 2000 2500
# Policy updates
−100
0
100
Reward per game
0 500 1000 1500 2000 2500
# Policy updates
0
50
100
150
Game length (steps)
0 500 1000 1500 2000 2500
# Policy updates
0.00
0.25
0.50
0.75
1.00
P[early stopping]
0 500 1000 1500 2000 2500
# Policy updates
0
100
200
300
# IDS Alerts per game
Attacker πθA simulation Attacker πθA emulation Defender πθD simulation Defender πθD emulation
Learning curves of training the the attacker and the defender
simultaneously in self-play.
33/34
Conclusions  Future Work
I Conclusions:
I We develop a method to find effective strategies for intrusion
prevention
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We show that self-learning can be successfully applied to
network infrastructures.
I Self-play reinforcement learning in Markov security game
I Key challenges: stable convergence, sample efficiency,
complexity of emulations, large state and action spaces,
theoretical understanding of optimal policies
I Our research plans:
I Extending the theoretical model
I Relaxing simplifying assumptions (e.g. multiple defender actions)
I Evaluation on real world infrastructures
34/34
References
I Finding Effective Security Strategies through Reinforcement
Learning and Self-Play20
I Preprint open access:
https://arxiv.org/abs/2009.08120
I Learning Intrusion Prevention Policies through Optimal
Stopping21
I Preprint open access:
https://arxiv.org/pdf/2106.07160.pdf
20
Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and
Self-Play”. In: International Conference on Network and Service Management (CNSM). Izmir, Turkey, Nov. 2020.
21
Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv:
2106.07160 [cs.AI].

Weitere ähnliche Inhalte

Was ist angesagt?

2013 Fighting Game Artificial Intelligence Competition
2013 Fighting Game Artificial Intelligence Competition2013 Fighting Game Artificial Intelligence Competition
2013 Fighting Game Artificial Intelligence Competitionftgaic
 
Evading & Bypassing Anti-Malware applications using metasploit
Evading & Bypassing Anti-Malware applications using metasploitEvading & Bypassing Anti-Malware applications using metasploit
Evading & Bypassing Anti-Malware applications using metasploitn|u - The Open Security Community
 
2015 Fighting Game Artificial Intelligence Competition
2015 Fighting Game Artificial Intelligence Competition2015 Fighting Game Artificial Intelligence Competition
2015 Fighting Game Artificial Intelligence Competitionftgaic
 
Secure Kernel Machines against Evasion Attacks
Secure Kernel Machines against Evasion AttacksSecure Kernel Machines against Evasion Attacks
Secure Kernel Machines against Evasion AttacksPluribus One
 
Enabling effective hunt teaming and incident response
Enabling effective hunt teaming and incident responseEnabling effective hunt teaming and incident response
Enabling effective hunt teaming and incident responsejeffmcjunkin
 

Was ist angesagt? (6)

Security of Machine Learning
Security of Machine LearningSecurity of Machine Learning
Security of Machine Learning
 
2013 Fighting Game Artificial Intelligence Competition
2013 Fighting Game Artificial Intelligence Competition2013 Fighting Game Artificial Intelligence Competition
2013 Fighting Game Artificial Intelligence Competition
 
Evading & Bypassing Anti-Malware applications using metasploit
Evading & Bypassing Anti-Malware applications using metasploitEvading & Bypassing Anti-Malware applications using metasploit
Evading & Bypassing Anti-Malware applications using metasploit
 
2015 Fighting Game Artificial Intelligence Competition
2015 Fighting Game Artificial Intelligence Competition2015 Fighting Game Artificial Intelligence Competition
2015 Fighting Game Artificial Intelligence Competition
 
Secure Kernel Machines against Evasion Attacks
Secure Kernel Machines against Evasion AttacksSecure Kernel Machines against Evasion Attacks
Secure Kernel Machines against Evasion Attacks
 
Enabling effective hunt teaming and incident response
Enabling effective hunt teaming and incident responseEnabling effective hunt teaming and incident response
Enabling effective hunt teaming and incident response
 

Ähnlich wie Self-Learning Systems for Cyber Security

CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...Kim Hammar
 
Nse seminar 4_dec_hammar_stadler
Nse seminar 4_dec_hammar_stadlerNse seminar 4_dec_hammar_stadler
Nse seminar 4_dec_hammar_stadlerKim Hammar
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security AutomationKim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionKim Hammar
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingKim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseKim Hammar
 
The Rising Tide Lifts All Boats: The Advancement of Science in Cybersecurity
The Rising Tide Lifts All Boats:  The Advancement of Science in Cybersecurity The Rising Tide Lifts All Boats:  The Advancement of Science in Cybersecurity
The Rising Tide Lifts All Boats: The Advancement of Science in Cybersecurity laurieannwilliams
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion ResponseKim Hammar
 
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATIONCYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATIONacijjournal
 
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATIONCYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATIONacijjournal
 
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...E Rey Garcia, MPA, DCS-EIS Candidate
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingKim Hammar
 
Multi-vocal Review of security orchestration
Multi-vocal Review of security orchestrationMulti-vocal Review of security orchestration
Multi-vocal Review of security orchestrationChadni Islam
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Kim Hammar
 
[Bucharest] Attack is easy, let's talk defence
[Bucharest] Attack is easy, let's talk defence[Bucharest] Attack is easy, let's talk defence
[Bucharest] Attack is easy, let's talk defenceOWASP EEE
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Kim Hammar
 
Literature review on cryptography
Literature review on cryptographyLiterature review on cryptography
Literature review on cryptographyTowsif Salauddin
 

Ähnlich wie Self-Learning Systems for Cyber Security (20)

CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
 
Nse seminar 4_dec_hammar_stadler
Nse seminar 4_dec_hammar_stadlerNse seminar 4_dec_hammar_stadler
Nse seminar 4_dec_hammar_stadler
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security Automation
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal Stopping
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber Defense
 
The Rising Tide Lifts All Boats: The Advancement of Science in Cybersecurity
The Rising Tide Lifts All Boats:  The Advancement of Science in Cybersecurity The Rising Tide Lifts All Boats:  The Advancement of Science in Cybersecurity
The Rising Tide Lifts All Boats: The Advancement of Science in Cybersecurity
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion Response
 
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATIONCYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
 
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATIONCYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
CYBERSECURITY INFRASTRUCTURE AND SECURITY AUTOMATION
 
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
Final Assignment - Evaluating Scholarly Articles - Area of Research Interest ...
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
Multi-vocal Review of security orchestration
Multi-vocal Review of security orchestrationMulti-vocal Review of security orchestration
Multi-vocal Review of security orchestration
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.
 
Chapter 1 2
Chapter 1 2Chapter 1 2
Chapter 1 2
 
[Bucharest] Attack is easy, let's talk defence
[Bucharest] Attack is easy, let's talk defence[Bucharest] Attack is easy, let's talk defence
[Bucharest] Attack is easy, let's talk defence
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
 
Literature review on cryptography
Literature review on cryptographyLiterature review on cryptography
Literature review on cryptography
 

Mehr von Kim Hammar

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesKim Hammar
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHKim Hammar
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlKim Hammar
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Kim Hammar
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Kim Hammar
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayKim Hammar
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Kim Hammar
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Kim Hammar
 
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...Kim Hammar
 
Självlärande system för cybersäkerhet
Självlärande system för cybersäkerhetSjälvlärande system för cybersäkerhet
Självlärande system för cybersäkerhetKim Hammar
 
Learning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingLearning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingKim Hammar
 
MuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupMuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupKim Hammar
 

Mehr von Kim Hammar (14)

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTH
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal Stopping
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-Play
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.
 
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
 
Självlärande system för cybersäkerhet
Självlärande system för cybersäkerhetSjälvlärande system för cybersäkerhet
Självlärande system för cybersäkerhet
 
Learning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingLearning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal Stopping
 
MuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupMuZero - ML + Security Reading Group
MuZero - ML + Security Reading Group
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Self-Learning Systems for Cyber Security

  • 1. 1/34 Self-Learning Systems for Cyber Security Ledningsregementet Enköping Kim Hammar & Rolf Stadler kimham@kth.se & stadler@kth.se Division of Network and Systems Engineering KTH Royal Institute of Technology August 18, 2021
  • 4. 4/34 Challenges: Evolving and Automated Attacks I Challenges: I Evolving & automated attacks I Complex infrastructures Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 5. 4/34 Goal: Automation and Learning I Challenges I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 6. 4/34 Approach: Game Model & Reinforcement Learning I Challenges: I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods I Our Approach: I Model network attack and defense as games. I Use reinforcement learning to learn policies. I Incorporate learned policies in self-learning systems. Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 7. 5/34 State of the Art I Game-Learning Programs: I TD-Gammon, AlphaGo Zero1 , OpenAI Five etc. I =⇒ Impressive empirical results of RL and self-play I Attack Simulations: I Automated threat modeling2 and intrusion detection etc. I =⇒ Need for automation and better security tooling I Mathematical Modeling: I Game theory3 I Markov decision theory, dynamic programming4 I =⇒ Many security operations involves strategic decision making 1 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2 Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi: 10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799. 3 Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA: Cambridge University Press, 2010. isbn: 0521119324. 4 Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena Scientific, 2005.
  • 8. 5/34 State of the Art I Game-Learning Programs: I TD-Gammon, AlphaGo Zero5 , OpenAI Five etc. I =⇒ Impressive empirical results of RL and self-play I Attack Simulations: I Automated threat modeling6 and intrusion detection etc. I =⇒ Need for automation and better security tooling I Mathematical Modeling: I Game theory7 I Markov decision theory, dynamic programming8 I =⇒ Many security operations involves strategic decision making 5 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 6 Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi: 10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799. 7 Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA: Cambridge University Press, 2010. isbn: 0521119324. 8 Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena Scientific, 2005.
  • 9. 5/34 State of the Art I Game-Learning Programs: I TD-Gammon, AlphaGo Zero9 , OpenAI Five etc. I =⇒ Impressive empirical results of RL and self-play I Attack Simulations: I Automated threat modeling10 and intrusion detection etc. I =⇒ Need for automation and better security tooling I Mathematical Modeling: I Game theory11 I Markov decision theory, dynamic programming12 I =⇒ Many security operations involves strategic decision making 9 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 10 Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi: 10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799. 11 Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA: Cambridge University Press, 2010. isbn: 0521119324. 12 Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena Scientific, 2005.
  • 10. 6/34 Our Work I Use Case: Intrusion Prevention I Our Method: I Emulating computer infrastructures I System identification and model creation I Reinforcement learning and generalization I Results: I Learning to Capture The Flag I Learning to Prevent Attacks (Optimal Stopping) I Conclusions and Future Work
  • 11. 7/34 Use Case: Intrusion Prevention I A Defender owns an infrastructure I Consists of connected components I Components run network services I Defender defends the infrastructure by monitoring and active defense I An Attacker seeks to intrude on the infrastructure I Has a partial view of the infrastructure I Wants to compromise specific components I Attacks by reconnaissance, exploitation and pivoting Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 12. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 13. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 14. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 15. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 16. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 17. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 18. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 19. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 20. 8/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 21. 9/34 Emulation System Σ Configuration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.
  • 22. 9/34 Emulation System Σ Configuration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.
  • 23. 10/34 Emulation: Execution Times of Replicated Operations 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Normalized Frequency Action execution times (costs) |N| = 25 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 50 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 75 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 100 I Fundamental issue: Computational methods for policy learning typically require samples on the order of 100k − 10M. I =⇒ Infeasible to optimize in the emulation system
  • 24. 11/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 25. 12/34 From Emulation to Simulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)
  • 26. 12/34 From Emulation to Simulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)
  • 27. 12/34 From Emulation to Simulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)
  • 28. 13/34 System Identification: Estimated Dynamics Model 172.18.4.2 Connections Failed Logins Accounts Online Users Logins Processes 172.18.4.3 172.18.4.10 172.18.4.21 172.18.4.79 IDS Alerts Alert Priorities Severe Alerts Warning Alerts Estimated Emulation Dynamics
  • 29. 14/34 System Identification: Estimated Dynamics Model −5 0 5 10 15 20 # New TCP/UDP Connections 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New TCP/UDP Connections 0 10 20 30 40 50 # New Failed Login Attempts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New Failed Login Attempts −30 −20 −10 0 10 20 30 # Created User Accounts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Created User Accounts −0.04 −0.02 0.00 0.02 0.04 # New Logged in Users 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New Logged in Users −0.04 −0.02 0.00 0.02 0.04 # Login Events 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New Login Events 0 20 40 60 80 100 120 140 # Created Processes 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Created Processes Node IP: 172.18.4.2 (b0, a0) (b1, a0) ...
  • 30. 15/34 System Identification: Estimated Dynamics Model 0 20 40 60 80 100 120 # IDS Alerts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] IDS Alerts 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 # Severe IDS Alerts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Severe IDS Alerts 0 20 40 60 80 100 # Warning IDS Alerts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Warning IDS Alerts 0 50 100 150 200 250 # IDS Alert Priorities 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] IDS Alert Priorities IDS Dynamics (b0, a0) (b1, a0) ...
  • 31. 16/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 32. 17/34 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=1 γt−1 rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ hPT t=1 γt−1 r(st, at) i I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ " ∇θ log πθ(a|s) | {z } actor Aπθ (s, a) | {z } critic # I Domain-Specific Challenges: I Partial observability I Large state space I Large action space I Non-stationary Environment due to attacker I Generalization Agent Environment at st+1 rt+1
  • 33. 17/34 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=1 γt−1 rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ hPT t=1 γt−1 r(st, at) i I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ " ∇θ log πθ(a|s) | {z } actor Aπθ (s, a) | {z } critic # I Domain-Specific Challenges: I Partial observability I Large state space I Large action space I Non-stationary Environment due to attacker I Generalization Agent Environment at st+1 rt+1
  • 34. 17/34 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=1 γt−1 rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ hPT t=1 γt−1 r(st, at) i I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ " ∇θ log πθ(a|s) | {z } actor Aπθ (s, a) | {z } critic # I Domain-Specific Challenges: I Partial observability I Large state space I Large action space I Non-stationary Environment due to attacker I Generalization Agent Environment at st+1 rt+1
  • 35. 17/34 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=1 γt−1 rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ hPT t=1 γt−1 r(st, at) i I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ " ∇θ log πθ(a|s) | {z } actor Aπθ (s, a) | {z } critic # I Domain-Specific Challenges: I Partial observability I Large state space I Large action space I Non-stationary Environment due to attacker I Generalization Agent Environment at st+1 rt+1
  • 36. 17/34 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E[ PT t=1 γt−1 rt+1] I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ [ PT t=1 γt−1 r(st , at )] I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ [∇θ log πθ(a|s)Aπθ (s, a)] I Domain-Specific Challenges: I Partial observability I Large state space I Large action space I Non-stationary Environment due to attacker I Generalization I Finding Effective Security Strategies through Reinforcement Learning and Self-Playa I Learning Intrusion Prevention Policies through Optimal Stoppingb a Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and Self-Play”. In: International Conference on Network and Service Management (CNSM). Izmir, Turkey, Nov. 2020. b Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv: 2106.07160 [cs.AI]. Agent Environment at st+1 rt+1
  • 37. 18/34 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 38. 19/34 The Target Infrastructure I Topology: I 30 Application Servers, 1 Gateway/IDS (Snort), 3 Clients, 1 Attacker, 1 Defender I Services I 31 SSH, 8 HTTP, 1 DNS, 1 Telnet, 2 FTP, 1 MongoDB, 2 SMTP, 2 Teamspeak 3, 22 SNMP, 12 IRC, 1 Elasticsearch, 12 NTP, 1 Samba, 19 PostgreSQL I RCE Vulnerabilities I 1 CVE-2010-0426, 1 CVE-2014-6271, 1 SQL Injection, 1 CVE-2015-3306, 1 CVE-2016-10033, 1 CVE-2015-5602, 1 CVE-2015-1427, 1 CVE-2017-7494 I 5 Brute-force vulnerabilities I Operating Systems I 23 Ubuntu-20, 1 Debian 9:2, 1 Debian Wheezy, 6 Debian Jessie, 1 Kali I Traffic I Client 1: HTTP, SSH, SNMP, ICMP I Client 2: IRC, PostgreSQL, SNMP I Client 3: FTP, DNS, Telnet Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Target infrastructure.
  • 39. 20/34 The Attacker Model: Capture the Flag (CTF) I The attacker has T time-steps to collect flags, with no prior knowledge I It can connect to a gateway that exposes public-facing services in the infrastructure. I It has a pre-defined set (cardinality ∼ 200) of network/shell commands available, each command has a cost I To collect flags, it has to interleave reconnaissance and exploits. I Objective: collect all flags with minimum cost Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Target infrastructure.
  • 40. 21/34 The Formal Attacker Model: A Partially Observed MDP I Model infrastructure as a graph G = hN, Ei I There are k flags at nodes C ⊆ N I Ni ∈ N has a node state si of m attributes I Network state s = {sA, si | i ∈ N} ∈ R|N|×m+|N| I Attacker observes oA ⊂ s (results of commands) I Action space: A = {aA 1 , . . . , aA k }, aA i (commands) I ∀(s, a) ∈ A × S, there is a probability ~ w A,(x) i,j of failure & a probability of detection ϕ(det(si ) · n A,(x) i,j ) I State transitions s → s0 are decided by a discrete dynamical system s0 = F(s, a) (a) Emulated Infrastructure (b) Graph Model N0, ~ S0 N1, ~ S1 N2, ~ S2 N3, ~ S3 N4, ~ S4 N5, ~ S5 N6, ~ S6 N7, ~ S7 R1 R2
  • 41. 21/34 The Formal Attacker Model: A Partially Observed MDP I Model infrastructure as a graph G = hN, Ei I There are k flags at nodes C ⊆ N I Ni ∈ N has a node state si of m attributes I Network state s = {sA, si | i ∈ N} ∈ R|N|×m+|N| I Attacker observes oA ⊂ s (results of commands) I Action space: A = {aA 1 , . . . , aA k }, aA i (commands) I ∀(s, a) ∈ A × S, there is a probability ~ w A,(x) i,j of failure & a probability of detection ϕ(det(si ) · n A,(x) i,j ) I State transitions s → s0 are decided by a discrete dynamical system s0 = F(s, a) (a) Emulated Infrastructure (b) Graph Model N0, ~ S0 N1, ~ S1 N2, ~ S2 N3, ~ S3 N4, ~ S4 N5, ~ S5 N6, ~ S6 N7, ~ S7 R1 R2
  • 42. 22/34 Learning to Capture the Flags: Training Attacker Policies 0 500 1000 1500 2000 2500 # Policy updates 0 25 50 75 100 % Flags captured per episode 0 500 1000 1500 2000 2500 # Policy updates 0.00 0.25 0.50 0.75 1.00 P[detected] 0 500 1000 1500 2000 2500 # Policy updates −300 −200 −100 0 100 Reward per episode 0 500 1000 1500 2000 2500 # Policy updates 0 25 50 75 100 Episode length (steps) 0 500 1000 1500 2000 2500 # Policy updates 2000 4000 6000 8000 Episode length (seconds) 0 500 1000 1500 2000 2500 # Policy updates 0 500 1000 1500 # IDS Alerts per episode πθ simulation πθ emulation upper bound Learning curves (training performance in simulation and evaluation performance in the emulation) of our proposed method.
  • 43. 23/34 Learning Security Policies through Optimal Stopping Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that interrupt the intrusion Episode I Intrusion Prevention as Optimal Stopping Problem: I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can “stop” the intrusion. I Stopping shuts down the service provided by the infrastructure. I =⇒ trade-off two objectives: service and security I Based on the observations, when is it optimal to stop?
  • 44. 23/34 Learning Security Policies through Optimal Stopping Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that interrupt the intrusion Episode I Intrusion Prevention as Optimal Stopping Problem: I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can “stop” the intrusion. I Stopping shuts down the service provided by the infrastructure. I =⇒ trade-off two objectives: service and security I Based on the observations, when is it optimal to stop?
  • 45. 23/34 Learning Security Policies through Optimal Stopping Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that interrupt the intrusion Episode I Intrusion Prevention as Optimal Stopping Problem: I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can “stop” the intrusion. I Stopping shuts down the service provided by the infrastructure. I =⇒ trade-off two objectives: service and security I Based on the observations, when is it optimal to stop?
  • 46. 23/34 Learning Security Policies through Optimal Stopping Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that interrupt the intrusion Episode I Intrusion Prevention as Optimal Stopping Problem: I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can “stop” the intrusion. I Stopping shuts down the service provided by the infrastructure. I =⇒ trade-off two objectives: service and security I Based on the observations, when is it optimal to stop?
  • 47. 23/34 Learning Security Policies through Optimal Stopping Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that interrupt the intrusion Episode I Intrusion Prevention as Optimal Stopping Problem: I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can “stop” the intrusion. I Stopping shuts down the service provided by the infrastructure. I =⇒ trade-off two objectives: service and security I Based on the observations, when is it optimal to stop?
  • 48. 23/34 Learning Security Policies through Optimal Stopping Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that interrupt the intrusion Episode I Intrusion Prevention as Optimal Stopping Problem: I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can “stop” the intrusion. I Stopping shuts down the service provided by the infrastructure. I =⇒ trade-off two objectives: service and security I Based on the observations, when is it optimal to stop?
  • 49. 23/34 Learning Security Policies through Optimal Stopping Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that interrupt the intrusion Episode I Intrusion Prevention as Optimal Stopping Problem: I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can “stop” the intrusion. I Stopping shuts down the service provided by the infrastructure. I =⇒ trade-off two objectives: service and security I Based on the observations, when is it optimal to stop?
  • 50. 23/34 Learning Security Policies through Optimal Stopping Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that interrupt the intrusion Episode I Intrusion Prevention as Optimal Stopping Problem: I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can “stop” the intrusion. I Stopping shuts down the service provided by the infrastructure. I =⇒ trade-off two objectives: service and security I Based on the observations, when is it optimal to stop?
  • 51. 24/34 A Partially Observed MDP Model for the Defender I States: I Intrusion state it ∈ {0, 1}, terminal state ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z. fXYZ (∆x, ∆y, ∆z|it, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ continue continue intrusion starts stop stop intrusion ends 1 20 40 60 80 100 time-step t −100 0 100 Reward function Rat st Stop reward Continue reward Intrusion starts 10 20 intrusion start time it 0.00 0.25 0.50 0.75 1.00 CDF I t (i t ) It ∼ Ge(p = 0.2)
  • 52. 24/34 A Partially Observed MDP Model for the Defender I States: I Intrusion state it ∈ {0, 1}, terminal state ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z. fXYZ (∆x, ∆y, ∆z|it, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ continue continue intrusion starts stop stop intrusion ends 1 20 40 60 80 100 time-step t −100 0 100 Reward function Rat st Stop reward Continue reward Intrusion starts 10 20 intrusion start time it 0.00 0.25 0.50 0.75 1.00 CDF I t (i t ) It ∼ Ge(p = 0.2)
  • 53. 24/34 A Partially Observed MDP Model for the Defender I States: I Intrusion state it ∈ {0, 1}, terminal state ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z. fXYZ (∆x, ∆y, ∆z|it, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ continue continue intrusion starts stop stop intrusion ends 1 20 40 60 80 100 time-step t −100 0 100 Reward function Rat st Stop reward Continue reward Intrusion starts 10 20 intrusion start time it 0.00 0.25 0.50 0.75 1.00 CDF I t (i t ) It ∼ Ge(p = 0.2)
  • 54. 24/34 A Partially Observed MDP Model for the Defender I States: I Intrusion state it ∈ {0, 1}, terminal state ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z. fXYZ (∆x, ∆y, ∆z|it, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ continue continue intrusion starts stop stop intrusion ends 1 20 40 60 80 100 time-step t −100 0 100 Reward function Rat st Stop reward Continue reward Intrusion starts 10 20 intrusion start time it 0.00 0.25 0.50 0.75 1.00 CDF I t (i t ) It ∼ Ge(p = 0.2)
  • 55. 24/34 A Partially Observed MDP Model for the Defender I States: I Intrusion state it ∈ {0, 1}, terminal state ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z. fXYZ (∆x, ∆y, ∆z|it, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ continue continue intrusion starts stop stop intrusion ends 1 20 40 60 80 100 time-step t −100 0 100 Reward function Rat st Stop reward Continue reward Intrusion starts 10 20 intrusion start time it 0.00 0.25 0.50 0.75 1.00 CDF I t (i t ) It ∼ Ge(p = 0.2)
  • 56. 24/34 A Partially Observed MDP Model for the Defender I States: I Intrusion state it ∈ {0, 1}, terminal state ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z. fXYZ (∆x, ∆y, ∆z|it, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ continue continue intrusion starts stop stop intrusion ends 1 20 40 60 80 100 time-step t −100 0 100 Reward function Rat st Stop reward Continue reward Intrusion starts 10 20 intrusion start time it 0.00 0.25 0.50 0.75 1.00 CDF I t (i t ) It ∼ Ge(p = 0.2)
  • 57. 24/34 A Partially Observed MDP Model for the Defender I States: I Intrusion state it ∈ {0, 1}, terminal state ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z. fXYZ (∆x, ∆y, ∆z|it, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ continue continue intrusion starts stop stop intrusion ends 1 20 40 60 80 100 time-step t −100 0 100 Reward function Rat st Stop reward Continue reward Intrusion starts 10 20 intrusion start time it 0.00 0.25 0.50 0.75 1.00 CDF I t (i t ) It ∼ Ge(p = 0.2)
  • 58. 25/34 Threshold Property of the Optimal Defender Policy (1/4) Theorem The optimal policy π∗ is a threshold policy of the form: π∗ b(1) = ( S (stop) if b(1) ≥ α∗ C (continue) otherwise where α∗ is a unique threshold and b(1) = P[st = 1|a1, o1, . . . , at−1, ot]. I To see this, consider the optimality condition (Bellman eq): π∗ b(1) = arg max a∈A r b(1), a + X o∈O P[o|b(1), a]V ∗ ba o(1) #
  • 59. 25/34 Threshold Property of the Optimal Defender Policy (1/4) Theorem The optimal policy π∗ is a threshold policy of the form: π∗ b(1) = ( S (stop) if b(1) ≥ α∗ C (continue) otherwise where α∗ is a unique threshold and b(1) = P[st = 1|a1, o1, . . . , at−1, ot]. I To see this, consider the optimality condition (Bellman eq): π∗ b(1) = arg max a∈A r b(1), a + X o∈O P[o|b(1), a]V ∗ ba o(1) #
  • 60. 26/34 Threshold Property of the Optimal Defender Policy (2/4) I We use A = {S, C} and derive: π∗ (b(1)) = argmax a∈A r b(1), S | {z } ω , r b(1), C + X o∈O P[o|b(1), C]V ∗ bC o (1) | {z } # I ω is the expected reward for stopping and is the expected cumulative reward for continuing I Expanding the expressions and rearranging terms, we derive that it is optimal to stop iff: b(1) ≥ 110 + P o∈O V ∗ bC o (1) pZ(o, 1, C) + (1 − p)Z(o, 0, C) 300 + X o∈O V ∗ b C o (1) pZ(o, 1, C) + (1 − p)Z(o, 0, C) − Z(o, 1, C) | {z } Threshold: αb(1)
  • 61. 26/34 Threshold Property of the Optimal Defender Policy (2/4) I We use A = {S, C} and derive: π∗ (b(1)) = argmax a∈A r b(1), S | {z } ω , r b(1), C + X o∈O P[o|b(1), C]V ∗ bC o (1) | {z } # I ω is the expected reward for stopping and is the expected cumulative reward for continuing I Expanding the expressions and rearranging terms, we derive that it is optimal to stop iff: b(1) ≥ 110 + P o∈O V ∗ bC o (1) pZ(o, 1, C) + (1 − p)Z(o, 0, C) 300 + X o∈O V ∗ b C o (1) pZ(o, 1, C) + (1 − p)Z(o, 0, C) − Z(o, 1, C) | {z } Threshold: αb(1)
  • 62. 27/34 Threshold Property of the Optimal Defender Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex13 I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 13 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635.
  • 63. 27/34 Threshold Property of the Optimal Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex14 I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 14 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635.
  • 64. 27/34 Threshold Property of the Optimal Defender Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex15 I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 15 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635.
  • 65. 27/34 Threshold Property of the Optimal Defender Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex16, S is also convex17 and has the form [α∗, β∗] where 0 ≤ α∗ ≤ β∗ ≤ 1. I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 16 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635. 17 Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing. Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
  • 66. 27/34 Threshold Property of the Optimal Defender Policy (3/4) I Thus π∗ is determined by the scalar thresholds αb(1). I it is optimal to stop if b(1) ≥ αb(1) I The stopping set is: S = n b(1) ∈ [0, 1] : b(1) ≥ αb(1) o I Since V ∗(b) is piecewise linear and convex18, S is also convex19 and has the form [α∗, β∗] where 0 ≤ α∗ ≤ β∗ ≤ 1. I When b(1) = 1 it is optimal to take the stop action S: π∗ (1) = arg max 100, −90 + X o∈O Z(o, 1, C)V ∗ bC o (1) # = S I This means that β∗ = 1 18 Edward J. Sondik. “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs”. In: Operations Research 26.2 (1978), pp. 282–304. issn: 0030364X, 15265463. url: http://www.jstor.org/stable/169635. 19 Vikram Krishnamurthy. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing. Cambridge University Press, 2016. doi: 10.1017/CBO9781316471104.
  • 67. 28/34 Threshold Property of the Optimal Defender Policy (4/4) I As the stopping set is S = [α∗, 1] and b(1) ∈ [0, 1] I We have that it is optimal to stop if b(1) ≥ α∗ I Hence, Theorem 1 follows: π∗ b(1) = ( S (stop) if b(1) ≥ α∗ C (continue) otherwise
  • 68. 28/34 Threshold Property of the Optimal Defender Policy (4/4) I As the stopping set is S = [α∗, 1] and b(1) ∈ [0, 1] I We have that it is optimal to stop if b(1) ≥ α∗ I Hence, Theorem 1 follows: π∗ b(1) = ( S (stop) if b(1) ≥ α∗ C (continue) otherwise
  • 69. 29/34 Static Attackers to Emulate Intrusions Time-steps t Actions 1–It ∼ Ge(0.2) (Intrusion has not started) It + 1–It + 7 Recon, brute-force attacks (SSH,Telnet,FTP) on N2, N4, N10, login(N2, N4, N10), backdoor(N2, N4, N10), Recon It + 8–It + 11 CVE-2014-6271 on N17, SSH brute-force attack on N12, login (N17, N12), backdoor(N17, N12) It + 12–X + 16 CVE-2010-0426 exploit on N12, Recon SQL-Injection on N18, login(N18), backdoor(N18) It + 17–It + 22 Recon, CVE-2015-1427 on N25, login(N25) Recon, CVE-2017-7494 exploit on N27, login(N27) Table 1: Attacker actions to emulate an intrusion.
  • 70. 30/34 Learning Security Policies through Optimal Stopping 0 1000 2000 3000 4000 # policy updates −100 0 100 200 Reward per episode 0 1000 2000 3000 4000 # policy updates 2 4 6 8 Episode length (steps) 0 1000 2000 3000 4000 # policy updates 0.2 0.4 0.6 0.8 1.0 P[intrusion interrupted] 0 1000 2000 3000 4000 # policy updates 0.0 0.2 0.4 0.6 0.8 1.0 P[early stopping] 0 1000 2000 3000 4000 # policy updates 1 2 3 4 Uninterrupted intrusion t Learned πθ vs NoisyAttacker Learned πθ vs StealthyAttacker t = 6 baseline (x + y) ≥ 1 baseline Upper bound π∗ Learning curves of training defender policies against static attackers.
  • 71. 31/34 Threshold Properties of the Learned Policies 0 100 200 300 400 # total alerts x + y 0.0 0.5 1.0 πθ(stop|x + y) πθ vs StealthyAttacker πθ vs NoisyAttacker 0 25 50 75 100 # login attempts z 0.00 0.05 0.10 0.15 πθ(stop|z) πθ vs StealthyAttacker πθ vs NoisyAttacker 0 50 100 150 200 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 πθ(stop|x, y) vs StealthyAttacker 0 50 100 150 200 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 πθ(stop|x, y) vs NoisyAttacker soft thresholds never stop severe alerts x warning alerts y warning alerts y severe alerts x
  • 72. 32/34 Open Challenge: Self-Play between Attacker and Defender 0 500 1000 1500 2000 2500 # Policy updates 0 20 40 60 80 % Flags captured per game 0 500 1000 1500 2000 2500 # Policy updates 0.00 0.25 0.50 0.75 1.00 P[detected] 0 500 1000 1500 2000 2500 # Policy updates −100 0 100 Reward per game 0 500 1000 1500 2000 2500 # Policy updates 0 50 100 150 Game length (steps) 0 500 1000 1500 2000 2500 # Policy updates 0.00 0.25 0.50 0.75 1.00 P[early stopping] 0 500 1000 1500 2000 2500 # Policy updates 0 100 200 300 # IDS Alerts per game Attacker πθA simulation Attacker πθA emulation Defender πθD simulation Defender πθD emulation Learning curves of training the the attacker and the defender simultaneously in self-play.
  • 73. 33/34 Conclusions Future Work I Conclusions: I We develop a method to find effective strategies for intrusion prevention I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement learning and (5) domain randomization and generalization. I We show that self-learning can be successfully applied to network infrastructures. I Self-play reinforcement learning in Markov security game I Key challenges: stable convergence, sample efficiency, complexity of emulations, large state and action spaces, theoretical understanding of optimal policies I Our research plans: I Extending the theoretical model I Relaxing simplifying assumptions (e.g. multiple defender actions) I Evaluation on real world infrastructures
  • 74. 34/34 References I Finding Effective Security Strategies through Reinforcement Learning and Self-Play20 I Preprint open access: https://arxiv.org/abs/2009.08120 I Learning Intrusion Prevention Policies through Optimal Stopping21 I Preprint open access: https://arxiv.org/pdf/2106.07160.pdf 20 Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and Self-Play”. In: International Conference on Network and Service Management (CNSM). Izmir, Turkey, Nov. 2020. 21 Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv: 2106.07160 [cs.AI].