Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Intrusion Prevention through Optimal Stopping and Self-Play
1. 1/28
Intrusion Prevention through
Optimal Stopping and Self-Play
NSE Seminar
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
Mar 18, 2022
2. 2/28
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I Has partial observability
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
3. 3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
4. 3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
5. 3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
6. 3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
7. 3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
8. 3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
9. 3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
10. 3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
11. 3/28
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
When to take a defensive action?
Which action to take?
13. 4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Reference points
Intrusion prevention milestones
14. 4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Reference points
Intrusion prevention milestones
15. 4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Reference points
Intrusion prevention milestones
16. 4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Rule-based &
Statistical
IDS/IPS
(2000s)
Reference points
Intrusion prevention milestones
17. 4/28
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Rule-based &
Statistical
IDS/IPS
(2000s)
Research:
ML-based IDS
RL-based IPS
Control-based IPS
Computational game theory IPS
(2010-present)
Reference points
Intrusion prevention milestones
18. 4/28
A Brief History of Intrusion Prevention
Shewhart’s
control chart
(1925)
CUSUM
(Page, 1951)
Wald’s test
(1945)
Shiryaev’s
test
(1963)
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Rule-based &
Statistical
IDS/IPS
(2000s)
Research:
ML-based IDS
RL-based IPS
Control-based IPS
Computational game theory IPS
(2010-present)
Reference points
Intrusion prevention milestones
Quality control milestones (optimal stopping)
19. 5/28
Our Approach
I Formal model:
I Controlled Hidden Markov Model
I Defender has partial observability
I A game if attacker is active
I Data collection:
I Emulated infrastructure
I Finding defender strategies:
I Self-play reinforcement learning
I Optimal stopping
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Reinforcement Learning,
Optimal Stopping&
Hidden Markov Model
Emulation system
Target infrastructure
20. 6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure,
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
21. 6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
22. 6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
23. 6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
24. 6/28
Outline
I Use Case & Approach:
I Intrusion prevention
I Reinforcement learning and optimal stopping
I Formal Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Structural result: multi-threshold best responses
I Stopping sets S
(1)
l are connected and nested
I Our Method
I Emulation of the target infrastructure
I System identification
I Our reinforcement learning algorithm
I Results & Conclusion
I Numerical evaluation results & Demo
I Conclusion & future work
25. 7/28
Intrusion Prevention through Optimal Stopping
I The system evolves in discrete time-steps.
I A stop action = a defensive action. (e.g. reconfigure IPS)
Stochastic
system
π: Stop?
Yes
No
Continue monitor
Defensive action
ot
I The L − lth stopping time τl is:
τl = inf{t : t > τl−1, at = S}, l ∈ 1, .., L, τL+1 = 0
I τl is a random variable from sample space Ω to N, which is
dependent on hτ = ρ1, a1, o1, . . . , aτ−1, oτ and independent of
aτ , oτ+1, . . .
26. 7/28
Intrusion Prevention through Optimal Stopping
I The system evolves in discrete time-steps.
I A stop action = a defensive action. (e.g. reconfigure IPS)
Stochastic
system
π: Stop?
Yes
No
Continue monitor
Defensive action
ot
I The L − lth stopping time τl is:
τl = inf{t : t > τl−1, at = S}, l ∈ 1, .., L, τL+1 = 0
I τl is a random variable from sample space Ω to N, which is
dependent on hτl
= ρ1, a1, o1, . . . , aτl −1, oτl
and independent
of aτl
, oτl +1, . . .
27. 8/28
Intrusion Prevention through Optimal Stopping
We consider the class of stopping times
Tt = {τl ≤ t|τl > τl−1} ∈ Fk (Fk =natural filtration on ht).
1 2 . . .
28. 8/28
Intrusion Prevention through Optimal Stopping
We consider the class of stopping times
Tt = {τl ≤ t|τl > τl−1} ∈ Fk (Fk =natural filtration on ht).
1 2 . . .
Given the observations:
t
ot
29. 8/28
Intrusion Prevention through Optimal Stopping
We consider the class of stopping times
Tt = {τl ≤ t|τl > τl+1} ∈ Fk (Fk =natural filtration on ht).
1 2 . . .
Given the observations:
t
ot
Find the optimal stopping times τ∗
L , τ∗
L−1, . . .:
t
ot
τ∗
L τ∗
L−1 τ∗
L−2 . . .
30. 9/28
The Defender’s Stop Actions
IP packets
Infrastructure
Defender,π1
Controls
Reads
IPS
Drop
rules ht
1. Ingress traffic goes through deep packet inspection at gateway
2. Gateway runs the Snort IDS/IPS and may drop packets
3. The defender controls the IPS configuration
4. At each stopping time, we update the IPS configuration
31. 10/28
The Defender’s Stop Actions
l Alarm class c Action
34 Attempted administrator privilege gain DROP
33 Attempted user privilege gain DROP
32 Inappropriate content was detected DROP
31 Potential corporate privacy violation DROP
30 Executable code was detected DROP
29 Successful administrator privilege gain DROP
28 Successful user privilege gain DROP
27 A network trojan was detected DROP
26 Unsuccessful user privilege gain DROP
25 Web application attack DROP
24 Attempted denial of service DROP
23 Attempted information leak DROP
22 Potentially Bad Traffic DROP
21 Attempt to login by a default username and password DROP
20 Detection of a denial of service attack DROP
.
.
.
.
.
.
.
.
.
Table 1: Defender stop actions in the emulation; l denotes the number of
stops remaining.
32. 11/28
Optimal Stopping Game
I We assume a strategic attacker
I Attacker knows which actions generate alarms
I Attacker tries to be stealthy
I Attacker may try to achieve denial of service
Attacker, π2
. . .
Clients
IP packets
Infrastructure
Defender,π1
Controls
Reads
IPS
Drop
rules ht
33. 12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the inrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the
times to update the IPS configuration
34. 12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the
times to update the IPS configuration
35. 12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
prevented
Game episode
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the
times to update the IPS configuration
36. 12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
prevented
Game episode
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,1, τ1,2, . . . determine the
times to update the IPS configuration
I We seek a Nash equilibrium (π∗
1, π∗
2), from which we can
extract the optimal defender strategy π∗
1 against a worst-case
attacker.
37. 12/28
Optimal Stopping Game
Attacker actions
Defender actions
t = 1
t
prevented
Game episode
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,1, τ1,2, . . . determine the
times to update the IPS configuration
I We seek a Nash equilibrium (π∗
1, π∗
2), from which we can
extract the optimal defender strategy π∗ against the
We model this game as a zero-sum partially observed stochastic game
38. 13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
39. 13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
40. 13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
41. 13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
42. 13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
43. 13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
44. 13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
45. 13/28
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IDS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
a
(2)
t = S
lt = 1
a
(1)
t = S
φ
l
t
46. 14/28
One-Sided Partial Observability
I We assume that the attacker has perfect information. Only
the defender has partial information.
I The attacker’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
I The defender’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
47. 14/28
One-Sided Partial Observability
I We assume that the attacker has perfect information. Only
the defender has partial information.
I The attacker’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
I The defender’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
I Makes it tractable to compute the defender’s belief
b
(1)
t,π2
(st) = P[st|ht, π2] (avoid nested beliefs)
48. 15/28
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l ) (1)
49. 15/28
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l ) (2)
50. 15/28
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l ) (3)
51. 15/28
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l )
55. 16/28
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1
S
(1)
2
.
.
.
S
(1)
L
α̃1
α̃2
α̃L . . .
56. 16/28
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1
S
(1)
2
.
.
.
S
(1)
L
α̃1
α̃2
α̃L . . .
b(1)
0 1
S
(2)
1,1
S
(2)
1,L
β̃1,1
β̃1,L
β̃0,1
. . . β̃0,L
. . .
S
(2)
0,1
S
(2)
0,L
Defender
Attacker
57. 17/28
Our Method for Learning Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
58. 17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
59. 17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
60. 17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
61. 17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
62. 17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
63. 17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
64. 17/28
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
65. 18/28
Emulating the Target Infrastructure
I Emulate hosts with docker containers
I Emulate IDS and vulnerabilities with
software
I Network isolation and traffic shaping
through NetEm in the Linux kernel
I Enforce resource constraints using
cgroups.
I Emulate client arrivals with Poisson
process
I Internal connections are full-duplex
& loss-less with bit capacities of 1000
Mbit/s
I External connections are full-duplex
with bit capacities of 100 Mbit/s &
0.1% packet loss in normal operation
and random bursts of 1% packet loss
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
66. 19/28
Running a Game Episode in the Emulation
I A distributed system with
synchronized clocks
I We run software sensors on all
emulated hosts
I Sensors produce messages to a
distributed queue (Kafka)
I A stream processor (Spark) consumes
messages from the queue and
computes statistics
I Actions are selected based on the
computed statistics and the strategies
I Actions are sent to the emulation
using gRPC
I Actions are executed by running
commands on the hosts
ot at
bt
ht
π1, π2
st
67. 20/28
Our Reinforcement Learning Approach
I We learn a Nash equilibrium (π∗
1,l,θ(1) , π∗
2,l,θ(2) ) through
fictitious self-play.
I In each iteration:
1. Learn a best response strategy of the defender by solving a
POMDP π̃1,l,θ(1) ∈ B1(π2,l,θ(2) ).
2. Learn a best response strategy of the attacker by solving an
MDP π̃2,l,θ(2) ∈ B2(π1,l,θ(1) ).
3. Store the best response strategies in two buffers Θ1, Θ2
4. Update strategies to be the average of the stored strategies
Strategy π2
Strategy π1
Strategy π0
2
Strategy π0
1
. . . Strategy π∗
2
Strategy π∗
1
Self-play process
68. 21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
69. 21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
70. 21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
71. 21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
72. 21/28
Fictitious Self-Play
Input
Γp: the POSG
Output
(π∗
1,θ, π∗
2,θ): an approximate Nash equilibrium
1: procedure ApproximateFP
2: θ(1) ∼ NL(−1, 1), θ(2) ∼ N2L(−1, 1)
3: Θ(1) ← {θ(1)}, Θ(2) ← {θ(2)}, δ̂ ← ∞
4: while δ̂ ≥ δ do
5: θ(1) ← ThresholdBR(Γp
, π2,l,θ, N, a, c, λ, A, )
6: θ(2) ← ThresholdBR(Γp
, π1,l,θ, N, a, c, λ, A, )
7: Θ(1) ← Θ(2) ∪ θ(1), Θ2 ← Θ(2) ∪ θ(2)
8: π1,l,θ ← MixtureDistribution(Θ(1))
9: π2,l,θ ← MixtureDistribution(Θ(2))
10: δ̂ = Exploitability(π1,l,θ, π2,l,θ)
11: end while
12: return (π1,l,θ, π2,l,θ)
13: end procedure
73. 22/28
Our Reinforcement Learning Algorithm for Learning
Best-Response Threshold Strategies
I We use the structural result that threshold best response
strategies exist (Theorem 1) to design an efficient
reinforcement learning algorithm to learn best response
strategies.
I We seek to learn:
I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1]
I 2L thresholds of the attacker, β̃0,1, β̃1,1, . . . , β̃0,L, β̃1,L ∈ [0, 1]
I We learn these thresholds iteratively through Robbins and
Monro’s stochastic approximation algorithm.1
Monte-Caro
Simulation
Stochastic
Approximation
(RL)
θn+1
ˆ
θ∗
Performance estimate
Ĵ(θn)
θ1
θn
1
Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical
Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url:
https://doi.org/10.1214/aoms/1177729586.
74. 22/28
Our Reinforcement Learning Algorithm for Learning
Best-Response Threshold Strategies
I We use the structural result that threshold best response
strategies exist (Theorem 1) to design an efficient
reinforcement learning algorithm to learn best response
strategies.
I We seek to learn:
I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1]
I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1]
I We learn these thresholds iteratively through Robbins and
Monro’s stochastic approximation algorithm.2
Monte-Caro
Simulation
Stochastic
Approximation
(RL)
θn+1
ˆ
θ∗
Performance estimate
Ĵ(θn)
θ1
θn
2
Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical
Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url:
https://doi.org/10.1214/aoms/1177729586.
75. 22/28
Our Reinforcement Learning Algorithm for Learning
Best-Response Threshold Strategies
I We use the structural result that threshold best response
strategies exist (Theorem 1) to design an efficient
reinforcement learning algorithm to learn best response
strategies.
I We seek to learn:
I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1]
I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1]
I We learn these thresholds iteratively through Robbins and
Monro’s stochastic approximation algorithm.3
Monte-Caro
Simulation
Stochastic
Approximation
(RL)
θn+1
ˆ
θ∗
Performance estimate
Ĵ(θn)
θ1
θn
3
Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical
Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url:
https://doi.org/10.1214/aoms/1177729586.
76. 23/28
Our Reinforcement Learning Algorithm
1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL,
θ(2) ∈ R2L
2. The policy gradient
∇θ(i) J(θ(i)
) = Eπi,l,θ(i)
∞
X
t=1
∇θ(i) log πi,l,θ(i) (a
(i)
t |st)
∞
X
τ=t
rt
#
exists as long as πi,l,θ(i) is differentiable.
3. A pure threshold strategy is not differentiable.
4. To ensure differentiability and to constrain the thresholds to
be in [0, 1], we define πi,θ(i),l to be a smooth stochastic
strategy that approximates a threshold strategy:
πi,θ(i) S|b(1)
=
1 +
b(1)(1 − σ(θ(i),j))
σ(θ(i),j)(1 − b(1))
!−20
−1
where σ(·) is the sigmoid function and σ(θ(i),j) is the
threshold.
77. 23/28
Our Reinforcement Learning Algorithm
1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL,
θ(2) ∈ R2L
2. The policy gradient
∇θ(i) J(θ(i)
) = Eπi,l,θ(i)
∞
X
t=1
∇θ(i) log πi,l,θ(i) (a
(i)
t |st)
∞
X
τ=t
rt
#
exists as long as πi,l,θ(i) is differentiable.
3. A pure threshold strategy is not differentiable.
4. To ensure differentiability and to constrain the thresholds to
be in [0, 1], we define πi,θ(i),l to be a smooth stochastic
strategy that approximates a threshold strategy:
πi,θ(i) S|b(1)
=
1 +
b(1)(1 − σ(θ(i),j))
σ(θ(i),j)(1 − b(1))
!−20
−1
where σ(·) is the sigmoid function and σ(θ(i),j) is the
threshold.
78. 23/28
Our Reinforcement Learning Algorithm
1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL,
θ(2) ∈ R2L
2. The policy gradient
∇θ(i) J(θ(i)
) = Eπi,l,θ(i)
∞
X
t=1
∇θ(i) log πi,l,θ(i) (a
(i)
t |st)
∞
X
τ=t
rt
#
exists as long as πi,l,θ(i) is differentiable.
3. A pure threshold strategy is not differentiable.
4. To ensure differentiability and to constrain the thresholds to
be in [0, 1], we define πi,θ(i),l to be a smooth stochastic
strategy that approximates a threshold strategy:
πi,θ(i) S|b(1)
=
1 +
b(1)(1 − σ(θ(i),j))
σ(θ(i),j)(1 − b(1))
!−20
−1
where σ(·) is the sigmoid function and σ(θ(i),j) is the
threshold.
79. 23/28
Our Reinforcement Learning Algorithm
1. Parameterize the strategies π1,l,θ(1) , π1,l,θ(2) by θ(1) ∈ RL,
θ(2) ∈ R2L
2. The policy gradient
∇θ(i) J(θ(i)
) = Eπi,l,θ(i)
∞
X
t=1
∇θ(i) log πi,l,θ(i) (a
(i)
t |st)
∞
X
τ=t
rt
#
exists as long as πi,l,θ(i) is differentiable.
3. A pure threshold strategy is not differentiable.
4. To ensure differentiability and to constrain the thresholds to
be in [0, 1], we define πi,θ(i),l to be a smooth stochastic
strategy that approximates a threshold strategy:
πi,θ(i) S|b(1)
=
1 +
b(1)(1 − σ(θ(i),j))
σ(θ(i),j)(1 − b(1))
!−20
−1
where σ(·) is the sigmoid function and σ(θ(i),j) is the
threshold.
81. 25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
82. 25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
83. 25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
84. 25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
85. 25/28
Our Reinforcement Learning Algorithm
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
86. 26/28
Simulation Results (Emulation Results TBD..)
0 20 40 60 80 100 120 140
training time (min)
0
2
4
6
8
Exploitability of learned strategies
Learned strategy pair Nash equilibrium
|O| = 102 |O| = 103 |O| = 104
0
25
50
75
100
Time
(min)
Convergence times
ApproximateFP NFSP
HSVI for OS-POSGs
Malign packets Benign
0
50
100
150
#
dropped
packets
Dropped packets
Learned equilibrium defender strategy
Heuristic: drop all alerts
Heuristic: drop alerts reactively
87. 27/28
Demo - A System for Interactive Examination of Learned
Security Strategies
Client web browser
Policy Examination
JavaScript frontend
Flask HTTP server
PostgreSQL
Traces Policy πθ
Emulation Simulation
s1,1 s1,2 . . . s1,n
s2,1 s2,2 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
Architecture of the system for examining learned security strategies.
88. 28/28
Conclusions Future Work
I Conclusions:
I We develop a method to automatically learn security policies
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We apply the method to an intrusion prevention use case
I We formulate intrusion prevention as a multiple stopping
problem
I We present a Partially Observed Stochastic Game of the use case
I We present a POMDP model of the defender’s problem
I We present a MDP model of the attacker’s problem
I We apply the stopping theory to establish structural results of the best response strategies
I Our research plans:
I Run experiments in the emulation system
I Make learned strategy available as plugin to the Snort IDS
I Extend the model
I Less restrictions on defender
Scaling up the emulation system:
I Non-static infrastructures