SlideShare ist ein Scribd-Unternehmen logo
1 von 88
Downloaden Sie, um offline zu lesen
1/28
Intrusion Prevention through
Optimal Stopping and Self-Play
NSE Seminar
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
Mar 18, 2022
2/28
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I Has partial observability
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
When to take a defensive action?
Which action to take?
4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Reference points
Intrusion prevention milestones
4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Reference points
Intrusion prevention milestones
4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Reference points
Intrusion prevention milestones
4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Rule-based &
Statistical
IDS/IPS
(2000s)
Reference points
Intrusion prevention milestones
4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Rule-based &
Statistical
IDS/IPS
(2000s)
Research:
ML-based IDS
RL-based IPS
Control-based IPS
Computational game theory IPS
(2010-present)
Reference points
Intrusion prevention milestones
4/28
A Brief History of Intrusion Prevention
Shewhart’s
control chart
(1925)
CUSUM
(Page, 1951)
Wald’s test
(1945)
Shiryaev’s
test
(1963)
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Rule-based &
Statistical
IDS/IPS
(2000s)
Research:
ML-based IDS
RL-based IPS
Control-based IPS
Computational game theory IPS
(2010-present)
Reference points
Intrusion prevention milestones
Quality control milestones (optimal stopping)
5/28
Our Approach
I Formal model:
I Controlled Hidden Markov Model
I Defender has partial observability
I A game if attacker is active
I Data collection:
I Emulated infrastructure
I Finding defender strategies:
I Self-play reinforcement learning
I Optimal stopping
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Reinforcement Learning,
Optimal Stopping&
Hidden Markov Model
Emulation system
Target infrastructure
6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure,
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
7/28
Intrusion Prevention through Optimal Stopping
I The system evolves in discrete time-steps.
I A stop action = a defensive action. (e.g. reconfigure IPS)
Stochastic
system
π: Stop?
Yes
No
Continue monitor
Defensive action
ot
I The L − lth stopping time τl is:
τl = inf{t : t > τl−1, at = S}, l ∈ 1, .., L, τL+1 = 0
I τl is a random variable from sample space Ω to N, which is
dependent on hτ = ρ1, a1, o1, . . . , aτ−1, oτ and independent of
aτ , oτ+1, . . .
7/28
Intrusion Prevention through Optimal Stopping
I The system evolves in discrete time-steps.
I A stop action = a defensive action. (e.g. reconfigure IPS)
Stochastic
system
π: Stop?
Yes
No
Continue monitor
Defensive action
ot
I The L − lth stopping time τl is:
τl = inf{t : t > τl−1, at = S}, l ∈ 1, .., L, τL+1 = 0
I τl is a random variable from sample space Ω to N, which is
dependent on hτl
= ρ1, a1, o1, . . . , aτl −1, oτl
and independent
of aτl
, oτl +1, . . .
8/28
Intrusion Prevention through Optimal Stopping
We consider the class of stopping times
Tt = {τl ≤ t|τl > τl−1} ∈ Fk (Fk =natural filtration on ht).
1 2 . . .
8/28
Intrusion Prevention through Optimal Stopping
We consider the class of stopping times
Tt = {τl ≤ t|τl > τl−1} ∈ Fk (Fk =natural filtration on ht).
1 2 . . .
Given the observations:
t
ot
8/28
Intrusion Prevention through Optimal Stopping
We consider the class of stopping times
Tt = {τl ≤ t|τl > τl+1} ∈ Fk (Fk =natural filtration on ht).
1 2 . . .
Given the observations:
t
ot
Find the optimal stopping times τ∗
L , τ∗
L−1, . . .:
t
ot
τ∗
L τ∗
L−1 τ∗
L−2 . . .
9/28
The Defender’s Stop Actions
IP packets
Infrastructure
Defender,π1
Controls
Reads
IPS
Drop
rules ht
1. Ingress traffic goes through deep packet inspection at gateway
2. Gateway runs the Snort IDS/IPS and may drop packets
3. The defender controls the IPS configuration
4. At each stopping time, we update the IPS configuration
10/28
The Defender’s Stop Actions
l Alarm class c Action
34 Attempted administrator privilege gain DROP
33 Attempted user privilege gain DROP
32 Inappropriate content was detected DROP
31 Potential corporate privacy violation DROP
30 Executable code was detected DROP
29 Successful administrator privilege gain DROP
28 Successful user privilege gain DROP
27 A network trojan was detected DROP
26 Unsuccessful user privilege gain DROP
25 Web application attack DROP
24 Attempted denial of service DROP
23 Attempted information leak DROP
22 Potentially Bad Traffic DROP
21 Attempt to login by a default username and password DROP
20 Detection of a denial of service attack DROP
.
.
.
.
.
.
.
.
.
Table 1: Defender stop actions in the emulation; l denotes the number of
stops remaining.
11/28
Optimal Stopping Game
I We assume a strategic attacker
I Attacker knows which actions generate alarms
I Attacker tries to be stealthy
I Attacker may try to achieve denial of service
Attacker, π2
. . .
Clients
IP packets
Infrastructure
Defender,π1
Controls
Reads
IPS
Drop
rules ht
12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the inrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the
times to update the IPS configuration
12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the
times to update the IPS configuration
12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
prevented
Game episode
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the
times to update the IPS configuration
12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
prevented
Game episode
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,1, τ1,2, . . . determine the
times to update the IPS configuration
I We seek a Nash equilibrium (π∗
1, π∗
2), from which we can
extract the optimal defender strategy π∗
1 against a worst-case
attacker.
12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
prevented
Game episode
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,1, τ1,2, . . . determine the
times to update the IPS configuration
I We seek a Nash equilibrium (π∗
1, π∗
2), from which we can
extract the optimal defender strategy π∗ against the
We model this game as a zero-sum partially observed stochastic game
13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
14/28
One-Sided Partial Observability
I We assume that the attacker has perfect information. Only
the defender has partial information.
I The attacker’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
I The defender’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
14/28
One-Sided Partial Observability
I We assume that the attacker has perfect information. Only
the defender has partial information.
I The attacker’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
I The defender’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
I Makes it tractable to compute the defender’s belief
b
(1)
t,π2
(st) = P[st|ht, π2] (avoid nested beliefs)
15/28
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l ) (1)
15/28
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l ) (2)
15/28
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l ) (3)
15/28
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l )
16/28
Structure of Best Response Strategies
b(1)
0 1
16/28
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1
α̃1
16/28
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1
S
(1)
2
α̃1
α̃2
16/28
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1
S
(1)
2
.
.
.
S
(1)
L
α̃1
α̃2
α̃L . . .
16/28
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1
S
(1)
2
.
.
.
S
(1)
L
α̃1
α̃2
α̃L . . .
b(1)
0 1
S
(2)
1,1
S
(2)
1,L
β̃1,1
β̃1,L
β̃0,1
. . . β̃0,L
. . .
S
(2)
0,1
S
(2)
0,L
Defender
Attacker
17/28
Our Method for Learning Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
18/28
Emulating the Target Infrastructure
I Emulate hosts with docker containers
I Emulate IDS and vulnerabilities with
software
I Network isolation and traffic shaping
through NetEm in the Linux kernel
I Enforce resource constraints using
cgroups.
I Emulate client arrivals with Poisson
process
I Internal connections are full-duplex
& loss-less with bit capacities of 1000
Mbit/s
I External connections are full-duplex
with bit capacities of 100 Mbit/s &
0.1% packet loss in normal operation
and random bursts of 1% packet loss
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
19/28
Running a Game Episode in the Emulation
I A distributed system with
synchronized clocks
I We run software sensors on all
emulated hosts
I Sensors produce messages to a
distributed queue (Kafka)
I A stream processor (Spark) consumes
messages from the queue and
computes statistics
I Actions are selected based on the
computed statistics and the strategies
I Actions are sent to the emulation
using gRPC
I Actions are executed by running
commands on the hosts
ot at
bt
ht
π1, π2
st
20/28
Our Reinforcement Learning Approach
I We learn a Nash equilibrium (π∗
1,l,θ(1) , π∗
2,l,θ(2) ) through
fictitious self-play.
I In each iteration:
1. Learn a best response strategy of the defender by solving a
POMDP π̃1,l,θ(1) ∈ B1(π2,l,θ(2) ).
2. Learn a best response strategy of the attacker by solving an
MDP π̃2,l,θ(2) ∈ B2(π1,l,θ(1) ).
3. Store the best response strategies in two buffers Θ1, Θ2
4. Update strategies to be the average of the stored strategies
Strategy π2
Strategy π1
Strategy π0
2
Strategy π0
1
. . . Strategy π∗
2
Strategy π∗
1
Self-play process
21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
22/28
Our Reinforcement Learning Algorithm for Learning
Best-Response Threshold Strategies
I We use the structural result that threshold best response
strategies exist (Theorem 1) to design an efficient
reinforcement learning algorithm to learn best response
strategies.
I We seek to learn:
I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1]
I 2L thresholds of the attacker, β̃0,1, β̃1,1, . . . , β̃0,L, β̃1,L ∈ [0, 1]
I We learn these thresholds iteratively through Robbins and
Monro’s stochastic approximation algorithm.1
Monte-Caro
Simulation
Stochastic
Approximation
(RL)
θn+1
ˆ
θ∗
Performance estimate
Ĵ(θn)
θ1
θn
1
Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical
Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url:
https://doi.org/10.1214/aoms/1177729586.
22/28
Our Reinforcement Learning Algorithm for Learning
Best-Response Threshold Strategies
I We use the structural result that threshold best response
strategies exist (Theorem 1) to design an efficient
reinforcement learning algorithm to learn best response
strategies.
I We seek to learn:
I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1]
I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1]
I We learn these thresholds iteratively through Robbins and
Monro’s stochastic approximation algorithm.2
Monte-Caro
Simulation
Stochastic
Approximation
(RL)
θn+1
ˆ
θ∗
Performance estimate
Ĵ(θn)
θ1
θn
2
Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical
Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url:
https://doi.org/10.1214/aoms/1177729586.
22/28
Our Reinforcement Learning Algorithm for Learning
Best-Response Threshold Strategies
I We use the structural result that threshold best response
strategies exist (Theorem 1) to design an efficient
reinforcement learning algorithm to learn best response
strategies.
I We seek to learn:
I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1]
I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1]
I We learn these thresholds iteratively through Robbins and
Monro’s stochastic approximation algorithm.3
Monte-Caro
Simulation
Stochastic
Approximation
(RL)
θn+1
ˆ
θ∗
Performance estimate
Ĵ(θn)
θ1
θn
3
Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical
Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url:
https://doi.org/10.1214/aoms/1177729586.
23/28
Our Reinforcement Learning Algorithm
1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL,
θ(2) ∈ R2L
2. The policy gradient
∇θ(i) J(θ(i)
) = Eπi,l,θ(i)
 ∞
X
t=1
∇θ(i) log πi,l,θ(i) (a
(i)
t |st)
∞
X
τ=t
rt
#
exists as long as πi,l,θ(i) is differentiable.
3. A pure threshold strategy is not differentiable.
4. To ensure differentiability and to constrain the thresholds to
be in [0, 1], we define πi,θ(i),l to be a smooth stochastic
strategy that approximates a threshold strategy:
πi,θ(i) S|b(1)

=

1 +
b(1)(1 − σ(θ(i),j))
σ(θ(i),j)(1 − b(1))
!−20


−1
where σ(·) is the sigmoid function and σ(θ(i),j) is the
threshold.
23/28
Our Reinforcement Learning Algorithm
1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL,
θ(2) ∈ R2L
2. The policy gradient
∇θ(i) J(θ(i)
) = Eπi,l,θ(i)
 ∞
X
t=1
∇θ(i) log πi,l,θ(i) (a
(i)
t |st)
∞
X
τ=t
rt
#
exists as long as πi,l,θ(i) is differentiable.
3. A pure threshold strategy is not differentiable.
4. To ensure differentiability and to constrain the thresholds to
be in [0, 1], we define πi,θ(i),l to be a smooth stochastic
strategy that approximates a threshold strategy:
πi,θ(i) S|b(1)

=

1 +
b(1)(1 − σ(θ(i),j))
σ(θ(i),j)(1 − b(1))
!−20


−1
where σ(·) is the sigmoid function and σ(θ(i),j) is the
threshold.
23/28
Our Reinforcement Learning Algorithm
1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL,
θ(2) ∈ R2L
2. The policy gradient
∇θ(i) J(θ(i)
) = Eπi,l,θ(i)
 ∞
X
t=1
∇θ(i) log πi,l,θ(i) (a
(i)
t |st)
∞
X
τ=t
rt
#
exists as long as πi,l,θ(i) is differentiable.
3. A pure threshold strategy is not differentiable.
4. To ensure differentiability and to constrain the thresholds to
be in [0, 1], we define πi,θ(i),l to be a smooth stochastic
strategy that approximates a threshold strategy:
πi,θ(i) S|b(1)

=

1 +
b(1)(1 − σ(θ(i),j))
σ(θ(i),j)(1 − b(1))
!−20


−1
where σ(·) is the sigmoid function and σ(θ(i),j) is the
threshold.
23/28
Our Reinforcement Learning Algorithm
1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL,
θ(2) ∈ R2L
2. The policy gradient
∇θ(i) J(θ(i)
) = Eπi,l,θ(i)
 ∞
X
t=1
∇θ(i) log πi,l,θ(i) (a
(i)
t |st)
∞
X
τ=t
rt
#
exists as long as πi,l,θ(i) is differentiable.
3. A pure threshold strategy is not differentiable.
4. To ensure differentiability and to constrain the thresholds to
be in [0, 1], we define πi,θ(i),l to be a smooth stochastic
strategy that approximates a threshold strategy:
πi,θ(i) S|b(1)

=

1 +
b(1)(1 − σ(θ(i),j))
σ(θ(i),j)(1 − b(1))
!−20


−1
where σ(·) is the sigmoid function and σ(θ(i),j) is the
threshold.
24/28
Smooth Threshold
0.0 0.2 0.4 0.6 0.8 1.0
b(1)
0.0
0.2
0.4
0.6
0.8
1.0

1 +

b(1)(1−σ(θ(i),j))
σ(θ(i),j)(1−b(1))
−20
−1
Step function
25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
26/28
Simulation Results (Emulation Results TBD..)
0 20 40 60 80 100 120 140
training time (min)
0
2
4
6
8
Exploitability  of learned strategies
Learned strategy pair Nash equilibrium
|O| = 102 |O| = 103 |O| = 104
0
25
50
75
100
Time
(min)
Convergence times
ApproximateFP NFSP
HSVI for OS-POSGs
Malign packets Benign
0
50
100
150
#
dropped
packets
Dropped packets
Learned equilibrium defender strategy
Heuristic: drop all alerts
Heuristic: drop alerts reactively
27/28
Demo - A System for Interactive Examination of Learned
Security Strategies
Client web browser
Policy Examination
JavaScript frontend
Flask HTTP server
PostgreSQL
Traces Policy πθ
Emulation Simulation
s1,1 s1,2 . . . s1,n
s2,1 s2,2 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
Architecture of the system for examining learned security strategies.
28/28
Conclusions  Future Work
I Conclusions:
I We develop a method to automatically learn security policies
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We apply the method to an intrusion prevention use case
I We formulate intrusion prevention as a multiple stopping
problem
I We present a Partially Observed Stochastic Game of the use case
I We present a POMDP model of the defender’s problem
I We present a MDP model of the attacker’s problem
I We apply the stopping theory to establish structural results of the best response strategies
I Our research plans:
I Run experiments in the emulation system
I Make learned strategy available as plugin to the Snort IDS
I Extend the model
I Less restrictions on defender
Scaling up the emulation system:
I Non-static infrastructures

Weitere ähnliche Inhalte

Ähnlich wie Intrusion Prevention through Optimal Stopping and Self-Play

Master Thesis Presentation
Master Thesis PresentationMaster Thesis Presentation
Master Thesis Presentation
Mohamed Sobh
 
Estimation of Reliability Indices of Two Component Identical System in the Pr...
Estimation of Reliability Indices of Two Component Identical System in the Pr...Estimation of Reliability Indices of Two Component Identical System in the Pr...
Estimation of Reliability Indices of Two Component Identical System in the Pr...
IJLT EMAS
 
Tlc 1223618496327769-9
Tlc 1223618496327769-9Tlc 1223618496327769-9
Tlc 1223618496327769-9
krunal yagnik
 
TESTING LIFE CYCLE PPT
TESTING LIFE CYCLE PPTTESTING LIFE CYCLE PPT
TESTING LIFE CYCLE PPT
suhasreddy1
 
20150603 establishing theoretical minimal sets of mutants (Lab seminar)
20150603 establishing theoretical minimal sets of mutants (Lab seminar)20150603 establishing theoretical minimal sets of mutants (Lab seminar)
20150603 establishing theoretical minimal sets of mutants (Lab seminar)
Donghwan Shin
 

Ähnlich wie Intrusion Prevention through Optimal Stopping and Self-Play (20)

Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.
 
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
 
Event tree analysis and risk assessment
Event tree analysis and risk assessmentEvent tree analysis and risk assessment
Event tree analysis and risk assessment
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
 
Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber Security
 
Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber Security
 
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
Security of Machine Learning
Security of Machine LearningSecurity of Machine Learning
Security of Machine Learning
 
Master Thesis Presentation
Master Thesis PresentationMaster Thesis Presentation
Master Thesis Presentation
 
reliability workshop
reliability workshopreliability workshop
reliability workshop
 
Estimation of Reliability Indices of Two Component Identical System in the Pr...
Estimation of Reliability Indices of Two Component Identical System in the Pr...Estimation of Reliability Indices of Two Component Identical System in the Pr...
Estimation of Reliability Indices of Two Component Identical System in the Pr...
 
Tlc 1223618496327769-9
Tlc 1223618496327769-9Tlc 1223618496327769-9
Tlc 1223618496327769-9
 
TESTING LIFE CYCLE PPT
TESTING LIFE CYCLE PPTTESTING LIFE CYCLE PPT
TESTING LIFE CYCLE PPT
 
EU ATT&CK community ATT&CK-onomics
EU ATT&CK community ATT&CK-onomicsEU ATT&CK community ATT&CK-onomics
EU ATT&CK community ATT&CK-onomics
 
Measuring credit risk in a large banking system: econometric modeling and emp...
Measuring credit risk in a large banking system: econometric modeling and emp...Measuring credit risk in a large banking system: econometric modeling and emp...
Measuring credit risk in a large banking system: econometric modeling and emp...
 
20150603 establishing theoretical minimal sets of mutants (Lab seminar)
20150603 establishing theoretical minimal sets of mutants (Lab seminar)20150603 establishing theoretical minimal sets of mutants (Lab seminar)
20150603 establishing theoretical minimal sets of mutants (Lab seminar)
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Petri Net Modelling of Physical Vulnerability
Petri Net Modelling of Physical VulnerabilityPetri Net Modelling of Physical Vulnerability
Petri Net Modelling of Physical Vulnerability
 

Mehr von Kim Hammar

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Kim Hammar
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
Kim Hammar
 
MuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupMuZero - ML + Security Reading Group
MuZero - ML + Security Reading Group
Kim Hammar
 

Mehr von Kim Hammar (14)

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTH
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion Response
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security Automation
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
 
Reinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against HeartbleedReinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
 
Självlärande system för cybersäkerhet
Självlärande system för cybersäkerhetSjälvlärande system för cybersäkerhet
Självlärande system för cybersäkerhet
 
MuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupMuZero - ML + Security Reading Group
MuZero - ML + Security Reading Group
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Intrusion Prevention through Optimal Stopping and Self-Play

  • 1. 1/28 Intrusion Prevention through Optimal Stopping and Self-Play NSE Seminar Kim Hammar & Rolf Stadler kimham@kth.se & stadler@kth.se Division of Network and Systems Engineering KTH Royal Institute of Technology Mar 18, 2022
  • 2. 2/28 Use Case: Intrusion Prevention I A Defender owns an infrastructure I Consists of connected components I Components run network services I Defender defends the infrastructure by monitoring and active defense I Has partial observability I An Attacker seeks to intrude on the infrastructure I Has a partial view of the infrastructure I Wants to compromise specific components I Attacks by reconnaissance, exploitation and pivoting Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 3. 3/28 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 4. 3/28 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 5. 3/28 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 6. 3/28 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 7. 3/28 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 8. 3/28 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 9. 3/28 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 10. 3/28 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 11. 3/28 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins When to take a defensive action? Which action to take?
  • 12. 4/28 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s)
  • 13. 4/28 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Reference points Intrusion prevention milestones
  • 14. 4/28 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Reference points Intrusion prevention milestones
  • 15. 4/28 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Reference points Intrusion prevention milestones
  • 16. 4/28 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Rule-based & Statistical IDS/IPS (2000s) Reference points Intrusion prevention milestones
  • 17. 4/28 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Rule-based & Statistical IDS/IPS (2000s) Research: ML-based IDS RL-based IPS Control-based IPS Computational game theory IPS (2010-present) Reference points Intrusion prevention milestones
  • 18. 4/28 A Brief History of Intrusion Prevention Shewhart’s control chart (1925) CUSUM (Page, 1951) Wald’s test (1945) Shiryaev’s test (1963) ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Rule-based & Statistical IDS/IPS (2000s) Research: ML-based IDS RL-based IPS Control-based IPS Computational game theory IPS (2010-present) Reference points Intrusion prevention milestones Quality control milestones (optimal stopping)
  • 19. 5/28 Our Approach I Formal model: I Controlled Hidden Markov Model I Defender has partial observability I A game if attacker is active I Data collection: I Emulated infrastructure I Finding defender strategies: I Self-play reinforcement learning I Optimal stopping s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Reinforcement Learning, Optimal Stopping& Hidden Markov Model Emulation system Target infrastructure
  • 20. 6/28 Outline I Use Case & Approach: I Intrusion prevention I Reinforcement learning and optimal stopping I Formal Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Structural result: multi-threshold best responses I Stopping sets S (1) l are connected and nested I Our Method I Emulation of the target infrastructure, I System identification I Our reinforcement learning algorithm I Results & Conclusion I Numerical evaluation results & Demo I Conclusion & future work
  • 21. 6/28 Outline I Use Case & Approach: I Intrusion prevention I Reinforcement learning and optimal stopping I Formal Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Structural result: multi-threshold best responses I Stopping sets S (1) l are connected and nested I Our Method I Emulation of the target infrastructure I System identification I Our reinforcement learning algorithm I Results & Conclusion I Numerical evaluation results & Demo I Conclusion & future work
  • 22. 6/28 Outline I Use Case & Approach: I Intrusion prevention I Reinforcement learning and optimal stopping I Formal Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Structural result: multi-threshold best responses I Stopping sets S (1) l are connected and nested I Our Method I Emulation of the target infrastructure I System identification I Our reinforcement learning algorithm I Results & Conclusion I Numerical evaluation results & Demo I Conclusion & future work
  • 23. 6/28 Outline I Use Case & Approach: I Intrusion prevention I Reinforcement learning and optimal stopping I Formal Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Structural result: multi-threshold best responses I Stopping sets S (1) l are connected and nested I Our Method I Emulation of the target infrastructure I System identification I Our reinforcement learning algorithm I Results & Conclusion I Numerical evaluation results & Demo I Conclusion & future work
  • 24. 6/28 Outline I Use Case & Approach: I Intrusion prevention I Reinforcement learning and optimal stopping I Formal Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Structural result: multi-threshold best responses I Stopping sets S (1) l are connected and nested I Our Method I Emulation of the target infrastructure I System identification I Our reinforcement learning algorithm I Results & Conclusion I Numerical evaluation results & Demo I Conclusion & future work
  • 25. 7/28 Intrusion Prevention through Optimal Stopping I The system evolves in discrete time-steps. I A stop action = a defensive action. (e.g. reconfigure IPS) Stochastic system π: Stop? Yes No Continue monitor Defensive action ot I The L − lth stopping time τl is: τl = inf{t : t > τl−1, at = S}, l ∈ 1, .., L, τL+1 = 0 I τl is a random variable from sample space Ω to N, which is dependent on hτ = ρ1, a1, o1, . . . , aτ−1, oτ and independent of aτ , oτ+1, . . .
  • 26. 7/28 Intrusion Prevention through Optimal Stopping I The system evolves in discrete time-steps. I A stop action = a defensive action. (e.g. reconfigure IPS) Stochastic system π: Stop? Yes No Continue monitor Defensive action ot I The L − lth stopping time τl is: τl = inf{t : t > τl−1, at = S}, l ∈ 1, .., L, τL+1 = 0 I τl is a random variable from sample space Ω to N, which is dependent on hτl = ρ1, a1, o1, . . . , aτl −1, oτl and independent of aτl , oτl +1, . . .
  • 27. 8/28 Intrusion Prevention through Optimal Stopping We consider the class of stopping times Tt = {τl ≤ t|τl > τl−1} ∈ Fk (Fk =natural filtration on ht). 1 2 . . .
  • 28. 8/28 Intrusion Prevention through Optimal Stopping We consider the class of stopping times Tt = {τl ≤ t|τl > τl−1} ∈ Fk (Fk =natural filtration on ht). 1 2 . . . Given the observations: t ot
  • 29. 8/28 Intrusion Prevention through Optimal Stopping We consider the class of stopping times Tt = {τl ≤ t|τl > τl+1} ∈ Fk (Fk =natural filtration on ht). 1 2 . . . Given the observations: t ot Find the optimal stopping times τ∗ L , τ∗ L−1, . . .: t ot τ∗ L τ∗ L−1 τ∗ L−2 . . .
  • 30. 9/28 The Defender’s Stop Actions IP packets Infrastructure Defender,π1 Controls Reads IPS Drop rules ht 1. Ingress traffic goes through deep packet inspection at gateway 2. Gateway runs the Snort IDS/IPS and may drop packets 3. The defender controls the IPS configuration 4. At each stopping time, we update the IPS configuration
  • 31. 10/28 The Defender’s Stop Actions l Alarm class c Action 34 Attempted administrator privilege gain DROP 33 Attempted user privilege gain DROP 32 Inappropriate content was detected DROP 31 Potential corporate privacy violation DROP 30 Executable code was detected DROP 29 Successful administrator privilege gain DROP 28 Successful user privilege gain DROP 27 A network trojan was detected DROP 26 Unsuccessful user privilege gain DROP 25 Web application attack DROP 24 Attempted denial of service DROP 23 Attempted information leak DROP 22 Potentially Bad Traffic DROP 21 Attempt to login by a default username and password DROP 20 Detection of a denial of service attack DROP . . . . . . . . . Table 1: Defender stop actions in the emulation; l denotes the number of stops remaining.
  • 32. 11/28 Optimal Stopping Game I We assume a strategic attacker I Attacker knows which actions generate alarms I Attacker tries to be stealthy I Attacker may try to achieve denial of service Attacker, π2 . . . Clients IP packets Infrastructure Defender,π1 Controls Reads IPS Drop rules ht
  • 33. 12/28 Optimal Stopping Game Attacker actions Defender actions t = 1 t I The attacker’s stopping times τ2,1, τ2,2, . . . determine the times to start/stop the intrusion I During the inrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the times to update the IPS configuration
  • 34. 12/28 Optimal Stopping Game Attacker actions Defender actions t = 1 t I The attacker’s stopping times τ2,1, τ2,2, . . . determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the times to update the IPS configuration
  • 35. 12/28 Optimal Stopping Game Attacker actions Defender actions t = 1 t prevented Game episode I The attacker’s stopping times τ2,1, τ2,2, . . . determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the times to update the IPS configuration
  • 36. 12/28 Optimal Stopping Game Attacker actions Defender actions t = 1 t prevented Game episode I The attacker’s stopping times τ2,1, τ2,2, . . . determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,1, τ1,2, . . . determine the times to update the IPS configuration I We seek a Nash equilibrium (π∗ 1, π∗ 2), from which we can extract the optimal defender strategy π∗ 1 against a worst-case attacker.
  • 37. 12/28 Optimal Stopping Game Attacker actions Defender actions t = 1 t prevented Game episode I The attacker’s stopping times τ2,1, τ2,2, . . . determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,1, τ1,2, . . . determine the times to update the IPS configuration I We seek a Nash equilibrium (π∗ 1, π∗ 2), from which we can extract the optimal defender strategy π∗ against the We model this game as a zero-sum partially observed stochastic game
  • 38. 13/28 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S a (2) t = S lt = 1 a (1) t = S φ l t
  • 39. 13/28 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S a (2) t = S lt = 1 a (1) t = S φ l t
  • 40. 13/28 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S a (2) t = S lt = 1 a (1) t = S φ l t
  • 41. 13/28 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S a (2) t = S lt = 1 a (1) t = S φ l t
  • 42. 13/28 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S a (2) t = S lt = 1 a (1) t = S φ l t
  • 43. 13/28 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S a (2) t = S lt = 1 a (1) t = S φ l t
  • 44. 13/28 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S a (2) t = S lt = 1 a (1) t = S φ l t
  • 45. 13/28 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S a (2) t = S lt = 1 a (1) t = S φ l t
  • 46. 14/28 One-Sided Partial Observability I We assume that the attacker has perfect information. Only the defender has partial information. I The attacker’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5 I The defender’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5
  • 47. 14/28 One-Sided Partial Observability I We assume that the attacker has perfect information. Only the defender has partial information. I The attacker’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5 I The defender’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5 I Makes it tractable to compute the defender’s belief b (1) t,π2 (st) = P[st|ht, π2] (avoid nested beliefs)
  • 48. 15/28 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l ) (1)
  • 49. 15/28 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l ) (2)
  • 50. 15/28 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l ) (3)
  • 51. 15/28 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l )
  • 52. 16/28 Structure of Best Response Strategies b(1) 0 1
  • 53. 16/28 Structure of Best Response Strategies b(1) 0 1 S (1) 1 α̃1
  • 54. 16/28 Structure of Best Response Strategies b(1) 0 1 S (1) 1 S (1) 2 α̃1 α̃2
  • 55. 16/28 Structure of Best Response Strategies b(1) 0 1 S (1) 1 S (1) 2 . . . S (1) L α̃1 α̃2 α̃L . . .
  • 56. 16/28 Structure of Best Response Strategies b(1) 0 1 S (1) 1 S (1) 2 . . . S (1) L α̃1 α̃2 α̃L . . . b(1) 0 1 S (2) 1,1 S (2) 1,L β̃1,1 β̃1,L β̃0,1 . . . β̃0,L . . . S (2) 0,1 S (2) 0,L Defender Attacker
  • 57. 17/28 Our Method for Learning Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 58. 17/28 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 59. 17/28 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 60. 17/28 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 61. 17/28 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 62. 17/28 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 63. 17/28 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 64. 17/28 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 65. 18/28 Emulating the Target Infrastructure I Emulate hosts with docker containers I Emulate IDS and vulnerabilities with software I Network isolation and traffic shaping through NetEm in the Linux kernel I Enforce resource constraints using cgroups. I Emulate client arrivals with Poisson process I Internal connections are full-duplex & loss-less with bit capacities of 1000 Mbit/s I External connections are full-duplex with bit capacities of 100 Mbit/s & 0.1% packet loss in normal operation and random bursts of 1% packet loss Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 66. 19/28 Running a Game Episode in the Emulation I A distributed system with synchronized clocks I We run software sensors on all emulated hosts I Sensors produce messages to a distributed queue (Kafka) I A stream processor (Spark) consumes messages from the queue and computes statistics I Actions are selected based on the computed statistics and the strategies I Actions are sent to the emulation using gRPC I Actions are executed by running commands on the hosts ot at bt ht π1, π2 st
  • 67. 20/28 Our Reinforcement Learning Approach I We learn a Nash equilibrium (π∗ 1,l,θ(1) , π∗ 2,l,θ(2) ) through fictitious self-play. I In each iteration: 1. Learn a best response strategy of the defender by solving a POMDP π̃1,l,θ(1) ∈ B1(π2,l,θ(2) ). 2. Learn a best response strategy of the attacker by solving an MDP π̃2,l,θ(2) ∈ B2(π1,l,θ(1) ). 3. Store the best response strategies in two buffers Θ1, Θ2 4. Update strategies to be the average of the stored strategies Strategy π2 Strategy π1 Strategy π0 2 Strategy π0 1 . . . Strategy π∗ 2 Strategy π∗ 1 Self-play process
  • 68. 21/28 Fictitious Self-Play Input Γp: the POSG Output (π∗ 1,θ, π∗ 2,θ): an approximate Nash equilibrium 1: procedure ApproximateFP 2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1) 3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞ 4: while δ̂ ≥ δ do 5: θ(1) ← ThresholdBR(Γp , π2,l,θ, N, a, c, λ, A, ) 6: θ(2) ← ThresholdBR(Γp , π1,l,θ, N, a, c, λ, A, ) 7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2) 8: π1,l,θ ← MixtureDistribution(Θ(1)) 9: π2,l,θ ← MixtureDistribution(Θ(2)) 10: δ̂ = Exploitability(π1,l,θ, π2,l,θ) 11: end while 12: return (π1,l,θ, π2,l,θ) 13: end procedure
  • 69. 21/28 Fictitious Self-Play Input Γp: the POSG Output (π∗ 1,θ, π∗ 2,θ): an approximate Nash equilibrium 1: procedure ApproximateFP 2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1) 3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞ 4: while δ̂ ≥ δ do 5: θ(1) ← ThresholdBR(Γp , π2,l,θ, N, a, c, λ, A, ) 6: θ(2) ← ThresholdBR(Γp , π1,l,θ, N, a, c, λ, A, ) 7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2) 8: π1,l,θ ← MixtureDistribution(Θ(1)) 9: π2,l,θ ← MixtureDistribution(Θ(2)) 10: δ̂ = Exploitability(π1,l,θ, π2,l,θ) 11: end while 12: return (π1,l,θ, π2,l,θ) 13: end procedure
  • 70. 21/28 Fictitious Self-Play Input Γp: the POSG Output (π∗ 1,θ, π∗ 2,θ): an approximate Nash equilibrium 1: procedure ApproximateFP 2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1) 3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞ 4: while δ̂ ≥ δ do 5: θ(1) ← ThresholdBR(Γp , π2,l,θ, N, a, c, λ, A, ) 6: θ(2) ← ThresholdBR(Γp , π1,l,θ, N, a, c, λ, A, ) 7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2) 8: π1,l,θ ← MixtureDistribution(Θ(1)) 9: π2,l,θ ← MixtureDistribution(Θ(2)) 10: δ̂ = Exploitability(π1,l,θ, π2,l,θ) 11: end while 12: return (π1,l,θ, π2,l,θ) 13: end procedure
  • 71. 21/28 Fictitious Self-Play Input Γp: the POSG Output (π∗ 1,θ, π∗ 2,θ): an approximate Nash equilibrium 1: procedure ApproximateFP 2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1) 3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞ 4: while δ̂ ≥ δ do 5: θ(1) ← ThresholdBR(Γp , π2,l,θ, N, a, c, λ, A, ) 6: θ(2) ← ThresholdBR(Γp , π1,l,θ, N, a, c, λ, A, ) 7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2) 8: π1,l,θ ← MixtureDistribution(Θ(1)) 9: π2,l,θ ← MixtureDistribution(Θ(2)) 10: δ̂ = Exploitability(π1,l,θ, π2,l,θ) 11: end while 12: return (π1,l,θ, π2,l,θ) 13: end procedure
  • 72. 21/28 Fictitious Self-Play Input Γp: the POSG Output (π∗ 1,θ, π∗ 2,θ): an approximate Nash equilibrium 1: procedure ApproximateFP 2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1) 3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞ 4: while δ̂ ≥ δ do 5: θ(1) ← ThresholdBR(Γp , π2,l,θ, N, a, c, λ, A, ) 6: θ(2) ← ThresholdBR(Γp , π1,l,θ, N, a, c, λ, A, ) 7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2) 8: π1,l,θ ← MixtureDistribution(Θ(1)) 9: π2,l,θ ← MixtureDistribution(Θ(2)) 10: δ̂ = Exploitability(π1,l,θ, π2,l,θ) 11: end while 12: return (π1,l,θ, π2,l,θ) 13: end procedure
  • 73. 22/28 Our Reinforcement Learning Algorithm for Learning Best-Response Threshold Strategies I We use the structural result that threshold best response strategies exist (Theorem 1) to design an efficient reinforcement learning algorithm to learn best response strategies. I We seek to learn: I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1] I 2L thresholds of the attacker, β̃0,1, β̃1,1, . . . , β̃0,L, β̃1,L ∈ [0, 1] I We learn these thresholds iteratively through Robbins and Monro’s stochastic approximation algorithm.1 Monte-Caro Simulation Stochastic Approximation (RL) θn+1 ˆ θ∗ Performance estimate Ĵ(θn) θ1 θn 1 Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url: https://doi.org/10.1214/aoms/1177729586.
  • 74. 22/28 Our Reinforcement Learning Algorithm for Learning Best-Response Threshold Strategies I We use the structural result that threshold best response strategies exist (Theorem 1) to design an efficient reinforcement learning algorithm to learn best response strategies. I We seek to learn: I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1] I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1] I We learn these thresholds iteratively through Robbins and Monro’s stochastic approximation algorithm.2 Monte-Caro Simulation Stochastic Approximation (RL) θn+1 ˆ θ∗ Performance estimate Ĵ(θn) θ1 θn 2 Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url: https://doi.org/10.1214/aoms/1177729586.
  • 75. 22/28 Our Reinforcement Learning Algorithm for Learning Best-Response Threshold Strategies I We use the structural result that threshold best response strategies exist (Theorem 1) to design an efficient reinforcement learning algorithm to learn best response strategies. I We seek to learn: I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1] I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1] I We learn these thresholds iteratively through Robbins and Monro’s stochastic approximation algorithm.3 Monte-Caro Simulation Stochastic Approximation (RL) θn+1 ˆ θ∗ Performance estimate Ĵ(θn) θ1 θn 3 Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url: https://doi.org/10.1214/aoms/1177729586.
  • 76. 23/28 Our Reinforcement Learning Algorithm 1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL, θ(2) ∈ R2L 2. The policy gradient ∇θ(i) J(θ(i) ) = Eπi,l,θ(i) ∞ X t=1 ∇θ(i) log πi,l,θ(i) (a (i) t |st) ∞ X τ=t rt # exists as long as πi,l,θ(i) is differentiable. 3. A pure threshold strategy is not differentiable. 4. To ensure differentiability and to constrain the thresholds to be in [0, 1], we define πi,θ(i),l to be a smooth stochastic strategy that approximates a threshold strategy: πi,θ(i) S|b(1) =  1 + b(1)(1 − σ(θ(i),j)) σ(θ(i),j)(1 − b(1)) !−20   −1 where σ(·) is the sigmoid function and σ(θ(i),j) is the threshold.
  • 77. 23/28 Our Reinforcement Learning Algorithm 1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL, θ(2) ∈ R2L 2. The policy gradient ∇θ(i) J(θ(i) ) = Eπi,l,θ(i) ∞ X t=1 ∇θ(i) log πi,l,θ(i) (a (i) t |st) ∞ X τ=t rt # exists as long as πi,l,θ(i) is differentiable. 3. A pure threshold strategy is not differentiable. 4. To ensure differentiability and to constrain the thresholds to be in [0, 1], we define πi,θ(i),l to be a smooth stochastic strategy that approximates a threshold strategy: πi,θ(i) S|b(1) =  1 + b(1)(1 − σ(θ(i),j)) σ(θ(i),j)(1 − b(1)) !−20   −1 where σ(·) is the sigmoid function and σ(θ(i),j) is the threshold.
  • 78. 23/28 Our Reinforcement Learning Algorithm 1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL, θ(2) ∈ R2L 2. The policy gradient ∇θ(i) J(θ(i) ) = Eπi,l,θ(i) ∞ X t=1 ∇θ(i) log πi,l,θ(i) (a (i) t |st) ∞ X τ=t rt # exists as long as πi,l,θ(i) is differentiable. 3. A pure threshold strategy is not differentiable. 4. To ensure differentiability and to constrain the thresholds to be in [0, 1], we define πi,θ(i),l to be a smooth stochastic strategy that approximates a threshold strategy: πi,θ(i) S|b(1) =  1 + b(1)(1 − σ(θ(i),j)) σ(θ(i),j)(1 − b(1)) !−20   −1 where σ(·) is the sigmoid function and σ(θ(i),j) is the threshold.
  • 79. 23/28 Our Reinforcement Learning Algorithm 1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL, θ(2) ∈ R2L 2. The policy gradient ∇θ(i) J(θ(i) ) = Eπi,l,θ(i) ∞ X t=1 ∇θ(i) log πi,l,θ(i) (a (i) t |st) ∞ X τ=t rt # exists as long as πi,l,θ(i) is differentiable. 3. A pure threshold strategy is not differentiable. 4. To ensure differentiability and to constrain the thresholds to be in [0, 1], we define πi,θ(i),l to be a smooth stochastic strategy that approximates a threshold strategy: πi,θ(i) S|b(1) =  1 + b(1)(1 − σ(θ(i),j)) σ(θ(i),j)(1 − b(1)) !−20   −1 where σ(·) is the sigmoid function and σ(θ(i),j) is the threshold.
  • 80. 24/28 Smooth Threshold 0.0 0.2 0.4 0.6 0.8 1.0 b(1) 0.0 0.2 0.4 0.6 0.8 1.0 1 + b(1)(1−σ(θ(i),j)) σ(θ(i),j)(1−b(1)) −20 −1 Step function
  • 81. 25/28 Our Reinforcement Learning Algorithm 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 82. 25/28 Our Reinforcement Learning Algorithm 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 83. 25/28 Our Reinforcement Learning Algorithm 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 84. 25/28 Our Reinforcement Learning Algorithm 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 85. 25/28 Our Reinforcement Learning Algorithm 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 86. 26/28 Simulation Results (Emulation Results TBD..) 0 20 40 60 80 100 120 140 training time (min) 0 2 4 6 8 Exploitability of learned strategies Learned strategy pair Nash equilibrium |O| = 102 |O| = 103 |O| = 104 0 25 50 75 100 Time (min) Convergence times ApproximateFP NFSP HSVI for OS-POSGs Malign packets Benign 0 50 100 150 # dropped packets Dropped packets Learned equilibrium defender strategy Heuristic: drop all alerts Heuristic: drop alerts reactively
  • 87. 27/28 Demo - A System for Interactive Examination of Learned Security Strategies Client web browser Policy Examination JavaScript frontend Flask HTTP server PostgreSQL Traces Policy πθ Emulation Simulation s1,1 s1,2 . . . s1,n s2,1 s2,2 . . . s2,n . . . . . . . . . . . . Architecture of the system for examining learned security strategies.
  • 88. 28/28 Conclusions Future Work I Conclusions: I We develop a method to automatically learn security policies I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement learning and (5) domain randomization and generalization. I We apply the method to an intrusion prevention use case I We formulate intrusion prevention as a multiple stopping problem I We present a Partially Observed Stochastic Game of the use case I We present a POMDP model of the defender’s problem I We present a MDP model of the attacker’s problem I We apply the stopping theory to establish structural results of the best response strategies I Our research plans: I Run experiments in the emulation system I Make learned strategy available as plugin to the Snort IDS I Extend the model I Less restrictions on defender Scaling up the emulation system: I Non-static infrastructures