We study automated intrusion prevention using reinforcement learning. In a novel approach, we formulate the problem of intrusion prevention as an optimal stopping problem. This formulation allows us insight into the structure of the optimal policies, which turn out to be threshold based. Since the computation of the optimal defender policy using dynamic programming is not feasible for practical cases, we approximate the optimal policy through reinforcement learning in a simulation environment. To define the dynamics of the simulation, we emulate the target infrastructure and collect measurements. Our evaluations show that the learned policies are close to optimal and that they indeed can be expressed using thresholds.
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
1. 1/23
Learning Intrusion Prevention Policies
Through Optimal Stopping1
CNSM 2021 | International Conference on Network and Service
Management
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
Oct 28, 2021
1
Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv:
2106.07160 [cs.AI].
3. 2/23
Goal: Automation and Learning
I Challenges
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
4. 2/23
Approach: Game Model & Reinforcement Learning
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
I Our Approach:
I Formal models of network attack
and defense
I Use reinforcement learning to learn
policies.
I Incorporate learned policies in
self-learning systems.
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
5. 3/23
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
6. 4/23
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
We formulate this use case as an Optimal Stopping problem
7. 5/23
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I Find the optimal stopping time τ∗
:
τ∗
= arg max
τ
Eτ
" τ−1
X
t=1
γt−1
RC
st st+1
+ γτ−1
RS
sτ sτ
#
(1)
where RS
ss0 & RC
ss0 are the stop/continue rewards
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 1947
I Since then it has been generalized and developed by (Chow,
Shiryaev & Kolmogorov, Bather, Bertsekas, etc.)
I Applications & Use Cases:
I Change detection, machine replacement, hypothesis testing,
gambling, selling decisions, queue management, advertisement
scheduling, the secretary problem, etc.
8. 5/23
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 19472
I Since then it has been generalized and developed by (Chow3
,
Shiryaev & Kolmogorov4
, Bather5
, Bertsekas6
, etc.)
2
Abraham Wald. Sequential Analysis. Wiley and Sons, New York, 1947.
3
Y. Chow, H. Robbins, and D. Siegmund. “Great expectations: The theory of optimal stopping”. In: 1971.
4
Albert N. Shirayev. Optimal Stopping Rules. Reprint of russian edition from 1969. Springer-Verlag Berlin,
2007.
5
John Bather. Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions. USA:
John Wiley and Sons, Inc., 2000. isbn: 0471976490.
6
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
9. 5/23
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 1947
I Since then it has been generalized and developed by (Chow,
Shiryaev & Kolmogorov, Bather, Bertsekas, etc.)
I Applications & Use Cases:
I Change detection7
, selling decisions8
, queue management9
,
advertisement scheduling10
, etc.
7
Alexander G. Tartakovsky et al. “Detection of intrusions in information systems by sequential change-point
methods”. In: Statistical Methodology 3.3 (2006). issn: 1572-3127. doi:
https://doi.org/10.1016/j.stamet.2005.05.003. url:
https://www.sciencedirect.com/science/article/pii/S1572312705000493.
8
Jacques du Toit and Goran Peskir. “Selling a stock at the ultimate maximum”. In: The Annals of Applied
Probability 19.3 (2009). issn: 1050-5164. doi: 10.1214/08-aap566. url:
http://dx.doi.org/10.1214/08-AAP566.
9
Arghyadip Roy et al. “Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision
Processes”. In: CoRR (2019). http://arxiv.org/abs/1912.10325. eprint: 1912.10325.
10
Vikram Krishnamurthy, Anup Aprem, and Sujay Bhatt. “Multiple stopping time POMDPs: Structural results
& application in interactive advertising on social media”. In: Automatica 95 (2018), pp. 385–398. issn:
0005-1098. doi: https://doi.org/10.1016/j.automatica.2018.06.013. url:
https://www.sciencedirect.com/science/article/pii/S0005109818303054.
10. 5/23
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st )T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 1947
I Since then it has been generalized and developed by (Chow, Shiryaev & Kolmogorov, Bather,
Bertsekas, etc.)
I Applications & Use Cases:
I Change detection11
, selling decisions12
, queue management13
,
advertisement scheduling14
, intrusion prevention15
etc.
11
Alexander G. Tartakovsky et al. “Detection of intrusions in information systems by sequential change-point
methods”. In: Statistical Methodology 3.3 (2006). issn: 1572-3127. doi:
https://doi.org/10.1016/j.stamet.2005.05.003. url:
https://www.sciencedirect.com/science/article/pii/S1572312705000493.
12
Jacques du Toit and Goran Peskir. “Selling a stock at the ultimate maximum”. In: The Annals of Applied
Probability 19.3 (2009). issn: 1050-5164. doi: 10.1214/08-aap566. url:
http://dx.doi.org/10.1214/08-AAP566.
13
Arghyadip Roy et al. “Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision
Processes”. In: CoRR (2019). http://arxiv.org/abs/1912.10325. eprint: 1912.10325.
14
Vikram Krishnamurthy, Anup Aprem, and Sujay Bhatt. “Multiple stopping time POMDPs: Structural results
& application in interactive advertising on social media”. In: Automatica 95 (2018), pp. 385–398. issn:
0005-1098. doi: https://doi.org/10.1016/j.automatica.2018.06.013. url:
https://www.sciencedirect.com/science/article/pii/S0005109818303054.
15
Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv:
2106.07160 [cs.AI].
11. 6/23
Formulating Intrusion Prevention as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
12. 6/23
Formulating Intrusion Prevention as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
13. 6/23
Formulating Intrusion Prevention as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
14. 6/23
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
15. 6/23
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
16. 6/23
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
17. 6/23
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
18. 7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
19. 7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
20. 7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
21. 7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
22. 7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
23. 7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
24. 7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
25. 7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
We analyze the optimal policy using optimal stopping theory
26. 8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (bt(0) = 1 − bt(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)
= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)
= C}
27. 8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
B(3): 2-dimensional unit-simplex
0.25 0.55
0.2
(1, 0, 0) (0, 1, 0)
(0, 0, 1)
(0.25, 0.55, 0.2)
B(2): 1-dimensional unit-simplex
(1, 0) (0, 1)
0.4 0.6
(0.4, 0.6)
28. 8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (bt(0) = 1 − bt(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)
= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)
= C}
29. 8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (bt(0) = 1 − bt(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)
= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)
= C}
30. 8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (b(0) = 1 − b(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)
= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)
= C}
31. 8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (bt(0) = 1 − bt(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)
= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)
= C}
34. 9/23
Threshold Properties of the Optimal Defender Policy
π∗(b(1))
0
0
1
1
α∗
action a
b(1)
continue
stop
belief space B = [0, 1]
continue set C
stopping set S
threshold
35. 10/23
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Simulation System
Reinforcement Learning
POMDP Model
Model estimation
Automation
Self-learning systems
36. 10/23
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Simulation System
Reinforcement Learning
POMDP Model
Model estimation
Automation
Self-learning systems
37. 11/23
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
38. 11/23
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
39. 12/23
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k
m3,1
.
.
.
m3,k
m7,1
.
.
.
m7,k
m4,1
.
.
.
m4,k
Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
40. 12/23
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k
m3,1
.
.
.
m3,k
m7,1
.
.
.
m7,k
m4,1
.
.
.
m4,k
Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
41. 12/23
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k
m3,1
.
.
.
m3,k
m7,1
.
.
.
m7,k
m4,1
.
.
.
m4,k
Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
42. 13/23
System Identification: Estimated Empirical Distributions
0 100 200 300 400 500 600
10−3
ˆ
f
XY
Z
(∆x)
# Severe IDS Alerts ∆x
0 50 100 150 200 250 300
10−3
10−1
ˆ
f
XY
Z
(∆y)
# Warning IDS Alerts ∆y
0 20 40 60 80 100
10−2
ˆ
f
XY
Z
(∆z)
# Login Attempts ∆z
vs Novice vs Experienced vs Expert No intrusion
43. 14/23
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Simulation System
Reinforcement Learning
POMDP Model
Model estimation
Automation
Self-learning systems
44. 15/23
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
∇θ log πθ(a|h)
| {z }
actor
Aπθ
(h, a)
| {z }
critic
#
I Method:
1. Simulate a series of POMDP episodes
2. Use episode outcomes and trajectories to estimate
∇θJ(θ)
3. Update policy πθ with stochastic gradient ascent
4. Continue until convergence
Agent
Environment
at
st+1
rt+1
45. 15/23
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
∇θ log πθ(a|h)
| {z }
actor
Aπθ
(h, a)
| {z }
critic
#
I Method:
1. Simulate a series of POMDP episodes
2. Use episode outcomes and trajectories to estimate
∇θJ(θ)
3. Update policy πθ with stochastic gradient ascent
4. Continue until convergence
Agent
Environment
at
st+1
rt+1
46. 15/23
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Method:
1. Simulate a series of POMDP episodes
2. Use episode outcomes and trajectories to estimate
∇θJ(θ)
3. Update policy πθ with stochastic gradient ascent
4. Continue until convergence
Agent
Environment
at
st+1
rt+1
47. 15/23
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ
∇θ log πθ(a|h)
| {z }
actor
Aπθ
(h, a)
| {z }
critic
#
I Method:
1. Simulate a series of POMDP episodes
2. Use episode outcomes and trajectories to estimate
∇θJ(θ)
3. Update policy πθ with stochastic gradient ascent
4. Continue until convergence
Agent
Environment
at
st+1
rt+1
48. 16/23
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Simulation System
Reinforcement Learning
POMDP Model
Model estimation
Automation
Self-learning systems
50. 18/23
Emulating the Client Population
Client Functions Application servers
1 HTTP, SSH, SNMP, ICMP N2, N3, N10, N12
2 IRC, PostgreSQL, SNMP N31, N13, N14, N15, N16
3 FTP, DNS, Telnet N10, N22, N4
Table 1: Emulated client population; each client interacts with
application servers using a set of functions at short intervals.
51. 19/23
Emulating the Defender’s Actions
Action Command in the Emulation
Stop iptables -A INPUT -i eth0 -j DROP
Table 2: Command used to implement the defender’s stop action.
52. 20/23
Emulating the Attacker’s Actions
Time-steps t Actions
1–It ∼ Ge(0.2) (Intrusion has not started)
It + 1–It + 7 Recon, brute-force attacks (SSH,Telnet,FTP)
on N2, N4, N10, login(N2, N4, N10),
backdoor(N2, N4, N10), Recon
It + 8–It + 11 CVE-2014-6271 on N17, SSH brute-force attack on N12,
login (N17, N12), backdoor(N17, N12)
It + 12–X + 16 CVE-2010-0426 exploit on N12, Recon
SQL-Injection on N18, login(N18), backdoor(N18)
It + 17–It + 22 Recon, CVE-2015-1427 on N25, login(N25)
Recon, CVE-2017-7494 exploit on N27, login(N27)
Table 3: Attacker actions to emulate an intrusion.
54. 22/23
Threshold Properties of the Learned Policies
0 100 200 300 400
# total alerts x + y
0.0
0.5
1.0
πθ(stop|x + y)
πθ vs StealthyAttacker
πθ vs NoisyAttacker
0 25 50 75 100
# login attempts z
0.00
0.05
0.10
0.15
πθ(stop|z)
πθ vs StealthyAttacker
πθ vs NoisyAttacker
0
50
100
150
200 0
50
100
150
200
0.0
0.2
0.4
0.6
0.8
1.0
πθ(stop|x, y) vs StealthyAttacker
0
50
100
150
200 0
50
100
150
200
0.0
0.2
0.4
0.6
0.8
πθ(stop|x, y) vs NoisyAttacker
soft
thresholds
never
stop
severe alerts x
warning alerts y
warning alerts y severe alerts x
55. 23/23
Conclusions Future Work
I Conclusions:
I We develop a method to find learn intrusion prevention
policies
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We formulate intrusion prevention as a optimal stopping
problem
I We present a POMDP model of the use case
I We apply the stopping theory to establish structural results of the optimal policy
I Our research plans:
I Extending the theoretical model
I Relaxing simplifying assumptions (e.g. more dynamic defender actions)
I Active attacker
I Multiple stops
I Evaluation on real world infrastructures