SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
1/23
Learning Intrusion Prevention Policies
Through Optimal Stopping1
CNSM 2021 | International Conference on Network and Service
Management
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
Oct 28, 2021
1
Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv:
2106.07160 [cs.AI].
2/23
Challenges: Evolving and Automated Attacks
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
2/23
Goal: Automation and Learning
I Challenges
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
2/23
Approach: Game Model & Reinforcement Learning
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
I Our Approach:
I Formal models of network attack
and defense
I Use reinforcement learning to learn
policies.
I Incorporate learned policies in
self-learning systems.
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
3/23
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
4/23
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
We formulate this use case as an Optimal Stopping problem
5/23
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I Find the optimal stopping time τ∗
:
τ∗
= arg max
τ
Eτ
" τ−1
X
t=1
γt−1
RC
st st+1
+ γτ−1
RS
sτ sτ
#
(1)
where RS
ss0 & RC
ss0 are the stop/continue rewards
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 1947
I Since then it has been generalized and developed by (Chow,
Shiryaev & Kolmogorov, Bather, Bertsekas, etc.)
I Applications & Use Cases:
I Change detection, machine replacement, hypothesis testing,
gambling, selling decisions, queue management, advertisement
scheduling, the secretary problem, etc.
5/23
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 19472
I Since then it has been generalized and developed by (Chow3
,
Shiryaev & Kolmogorov4
, Bather5
, Bertsekas6
, etc.)
2
Abraham Wald. Sequential Analysis. Wiley and Sons, New York, 1947.
3
Y. Chow, H. Robbins, and D. Siegmund. “Great expectations: The theory of optimal stopping”. In: 1971.
4
Albert N. Shirayev. Optimal Stopping Rules. Reprint of russian edition from 1969. Springer-Verlag Berlin,
2007.
5
John Bather. Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions. USA:
John Wiley and Sons, Inc., 2000. isbn: 0471976490.
6
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
5/23
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 1947
I Since then it has been generalized and developed by (Chow,
Shiryaev & Kolmogorov, Bather, Bertsekas, etc.)
I Applications & Use Cases:
I Change detection7
, selling decisions8
, queue management9
,
advertisement scheduling10
, etc.
7
Alexander G. Tartakovsky et al. “Detection of intrusions in information systems by sequential change-point
methods”. In: Statistical Methodology 3.3 (2006). issn: 1572-3127. doi:
https://doi.org/10.1016/j.stamet.2005.05.003. url:
https://www.sciencedirect.com/science/article/pii/S1572312705000493.
8
Jacques du Toit and Goran Peskir. “Selling a stock at the ultimate maximum”. In: The Annals of Applied
Probability 19.3 (2009). issn: 1050-5164. doi: 10.1214/08-aap566. url:
http://dx.doi.org/10.1214/08-AAP566.
9
Arghyadip Roy et al. “Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision
Processes”. In: CoRR (2019). http://arxiv.org/abs/1912.10325. eprint: 1912.10325.
10
Vikram Krishnamurthy, Anup Aprem, and Sujay Bhatt. “Multiple stopping time POMDPs: Structural results
& application in interactive advertising on social media”. In: Automatica 95 (2018), pp. 385–398. issn:
0005-1098. doi: https://doi.org/10.1016/j.automatica.2018.06.013. url:
https://www.sciencedirect.com/science/article/pii/S0005109818303054.
5/23
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st )T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 1947
I Since then it has been generalized and developed by (Chow, Shiryaev & Kolmogorov, Bather,
Bertsekas, etc.)
I Applications & Use Cases:
I Change detection11
, selling decisions12
, queue management13
,
advertisement scheduling14
, intrusion prevention15
etc.
11
Alexander G. Tartakovsky et al. “Detection of intrusions in information systems by sequential change-point
methods”. In: Statistical Methodology 3.3 (2006). issn: 1572-3127. doi:
https://doi.org/10.1016/j.stamet.2005.05.003. url:
https://www.sciencedirect.com/science/article/pii/S1572312705000493.
12
Jacques du Toit and Goran Peskir. “Selling a stock at the ultimate maximum”. In: The Annals of Applied
Probability 19.3 (2009). issn: 1050-5164. doi: 10.1214/08-aap566. url:
http://dx.doi.org/10.1214/08-AAP566.
13
Arghyadip Roy et al. “Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision
Processes”. In: CoRR (2019). http://arxiv.org/abs/1912.10325. eprint: 1912.10325.
14
Vikram Krishnamurthy, Anup Aprem, and Sujay Bhatt. “Multiple stopping time POMDPs: Structural results
& application in interactive advertising on social media”. In: Automatica 95 (2018), pp. 385–398. issn:
0005-1098. doi: https://doi.org/10.1016/j.automatica.2018.06.013. url:
https://www.sciencedirect.com/science/article/pii/S0005109818303054.
15
Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv:
2106.07160 [cs.AI].
6/23
Formulating Intrusion Prevention as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
6/23
Formulating Intrusion Prevention as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
6/23
Formulating Intrusion Prevention as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
6/23
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
6/23
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
6/23
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
6/23
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
7/23
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1 t ≥ 2
intrusion starts
Qt = 1
stop
t ≥ It + Tint
intrusion succeeds
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
We analyze the optimal policy using optimal stopping theory
8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (bt(0) = 1 − bt(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)

= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)

= C}
8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
B(3): 2-dimensional unit-simplex
0.25 0.55
0.2
(1, 0, 0) (0, 1, 0)
(0, 0, 1)
(0.25, 0.55, 0.2)
B(2): 1-dimensional unit-simplex
(1, 0) (0, 1)
0.4 0.6
(0.4, 0.6)
8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (bt(0) = 1 − bt(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)

= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)

= C}
8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (bt(0) = 1 − bt(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)

= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)

= C}
8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (b(0) = 1 − b(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)

= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)

= C}
8/23
Belief States and Stopping Set
I Belief States:
I The belief state bt ∈ B is defined as bt(st) = P[st|ht]
I bt is a sufficient statistic of st based on
ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H
I B is the unit (|S| − 1)-simplex
I Characterizing the Optimal Policy π∗:
I To characterize the optimal policy π∗
we partition B based on
optimal actions.
I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and
bt(1) = P[st = 1|ht]
I Since bt(0) + bt(1) = 1, bt is completely characterized by
bt(1), (bt(0) = 1 − bt(1))
I Hence, B is the unit interval [0, 1]
I Stopping set S = {b(1) ∈ [0, 1] : π∗
b(1)

= S}
I Continue set C = {b(1) ∈ [0, 1] : π∗
b(1)

= C}
9/23
Threshold Properties of the Optimal Defender Policy
0 1
b(1)
belief space B = [0, 1]
9/23
Threshold Properties of the Optimal Defender Policy
0 1
α∗
b(1)
belief space B = [0, 1]
9/23
Threshold Properties of the Optimal Defender Policy
π∗(b(1))
0
0
1
1
α∗
action a
b(1)
continue
stop
belief space B = [0, 1]
continue set C
stopping set S
threshold
10/23
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Simulation System
Reinforcement Learning 
POMDP Model
Model estimation
Automation 
Self-learning systems
10/23
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Simulation System
Reinforcement Learning 
POMDP Model
Model estimation
Automation 
Self-learning systems
11/23
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
11/23
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
12/23
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
12/23
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
12/23
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
13/23
System Identification: Estimated Empirical Distributions
0 100 200 300 400 500 600
10−3
ˆ
f
XY
Z
(∆x)
# Severe IDS Alerts ∆x
0 50 100 150 200 250 300
10−3
10−1
ˆ
f
XY
Z
(∆y)
# Warning IDS Alerts ∆y
0 20 40 60 80 100
10−2
ˆ
f
XY
Z
(∆z)
# Login Attempts ∆z
vs Novice vs Experienced vs Expert No intrusion
14/23
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Simulation System
Reinforcement Learning 
POMDP Model
Model estimation
Automation 
Self-learning systems
15/23
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ

∇θ log πθ(a|h)
| {z }
actor
Aπθ
(h, a)
| {z }
critic
#
I Method:
1. Simulate a series of POMDP episodes
2. Use episode outcomes and trajectories to estimate
∇θJ(θ)
3. Update policy πθ with stochastic gradient ascent
4. Continue until convergence
Agent
Environment
at
st+1
rt+1
15/23
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ

∇θ log πθ(a|h)
| {z }
actor
Aπθ
(h, a)
| {z }
critic
#
I Method:
1. Simulate a series of POMDP episodes
2. Use episode outcomes and trajectories to estimate
∇θJ(θ)
3. Update policy πθ with stochastic gradient ascent
4. Continue until convergence
Agent
Environment
at
st+1
rt+1
15/23
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ

∇θ log πθ(a|s)
| {z }
actor
Aπθ
(s, a)
| {z }
critic
#
I Method:
1. Simulate a series of POMDP episodes
2. Use episode outcomes and trajectories to estimate
∇θJ(θ)
3. Update policy πθ with stochastic gradient ascent
4. Continue until convergence
Agent
Environment
at
st+1
rt+1
15/23
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=1 γt−1
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eπθ
hPT
t=1 γt−1
r(st, at)
i
I Maximize J(θ) by stochastic gradient ascent
∇θJ(θ) = Eπθ

∇θ log πθ(a|h)
| {z }
actor
Aπθ
(h, a)
| {z }
critic
#
I Method:
1. Simulate a series of POMDP episodes
2. Use episode outcomes and trajectories to estimate
∇θJ(θ)
3. Update policy πθ with stochastic gradient ascent
4. Continue until convergence
Agent
Environment
at
st+1
rt+1
16/23
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Simulation System
Reinforcement Learning 
POMDP Model
Model estimation
Automation 
Self-learning systems
17/23
The Target Infrastructure
I Topology:
I 30 Application Servers, 1 Gateway/IDS (Snort), 3 Clients, 1 Attacker,
1 Defender
I Services
I 31 SSH, 8 HTTP, 1 DNS, 1 Telnet, 2 FTP, 1 MongoDB, 2 SMTP, 2
Teamspeak 3, 22 SNMP, 12 IRC, 1 Elasticsearch, 12 NTP, 1 Samba,
19 PostgreSQL
I RCE Vulnerabilities
I 1 CVE-2010-0426, 1 CVE-2014-6271, 1 SQL Injection, 1
CVE-2015-3306, 1 CVE-2016-10033, 1 CVE-2015-5602, 1
CVE-2015-1427, 1 CVE-2017-7494
I 5 Brute-force vulnerabilities
I Operating Systems
I 23 Ubuntu-20, 1 Debian 9:2, 1 Debian Wheezy, 6 Debian Jessie, 1
Kali
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
Target infrastructure.
18/23
Emulating the Client Population
Client Functions Application servers
1 HTTP, SSH, SNMP, ICMP N2, N3, N10, N12
2 IRC, PostgreSQL, SNMP N31, N13, N14, N15, N16
3 FTP, DNS, Telnet N10, N22, N4
Table 1: Emulated client population; each client interacts with
application servers using a set of functions at short intervals.
19/23
Emulating the Defender’s Actions
Action Command in the Emulation
Stop iptables -A INPUT -i eth0 -j DROP
Table 2: Command used to implement the defender’s stop action.
20/23
Emulating the Attacker’s Actions
Time-steps t Actions
1–It ∼ Ge(0.2) (Intrusion has not started)
It + 1–It + 7 Recon, brute-force attacks (SSH,Telnet,FTP)
on N2, N4, N10, login(N2, N4, N10),
backdoor(N2, N4, N10), Recon
It + 8–It + 11 CVE-2014-6271 on N17, SSH brute-force attack on N12,
login (N17, N12), backdoor(N17, N12)
It + 12–X + 16 CVE-2010-0426 exploit on N12, Recon
SQL-Injection on N18, login(N18), backdoor(N18)
It + 17–It + 22 Recon, CVE-2015-1427 on N25, login(N25)
Recon, CVE-2017-7494 exploit on N27, login(N27)
Table 3: Attacker actions to emulate an intrusion.
21/23
Learning Intrusion Prevention Policies through Optimal
Stopping
0 1000 2000 3000 4000
# policy updates
−100
0
100
200
Reward per episode
0 1000 2000 3000 4000
# policy updates
2
4
6
8
Episode length (steps)
0 1000 2000 3000 4000
# policy updates
0.2
0.4
0.6
0.8
1.0
P[intrusion interrupted]
0 1000 2000 3000 4000
# policy updates
0.0
0.2
0.4
0.6
0.8
1.0
P[early stopping]
0 1000 2000 3000 4000
# policy updates
1
2
3
4
Uninterrupted intrusion t
Learned πθ vs NoisyAttacker Learned πθ vs StealthyAttacker t = 6 baseline (x + y) ≥ 1 baseline Upper bound π∗
Learning curves of training defender policies against static attackers.
22/23
Threshold Properties of the Learned Policies
0 100 200 300 400
# total alerts x + y
0.0
0.5
1.0
πθ(stop|x + y)
πθ vs StealthyAttacker
πθ vs NoisyAttacker
0 25 50 75 100
# login attempts z
0.00
0.05
0.10
0.15
πθ(stop|z)
πθ vs StealthyAttacker
πθ vs NoisyAttacker
0
50
100
150
200 0
50
100
150
200
0.0
0.2
0.4
0.6
0.8
1.0
πθ(stop|x, y) vs StealthyAttacker
0
50
100
150
200 0
50
100
150
200
0.0
0.2
0.4
0.6
0.8
πθ(stop|x, y) vs NoisyAttacker
soft
thresholds
never
stop
severe alerts x
warning alerts y
warning alerts y severe alerts x
23/23
Conclusions  Future Work
I Conclusions:
I We develop a method to find learn intrusion prevention
policies
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We formulate intrusion prevention as a optimal stopping
problem
I We present a POMDP model of the use case
I We apply the stopping theory to establish structural results of the optimal policy
I Our research plans:
I Extending the theoretical model
I Relaxing simplifying assumptions (e.g. more dynamic defender actions)
I Active attacker
I Multiple stops
I Evaluation on real world infrastructures

Weitere ähnliche Inhalte

Ähnlich wie Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021

Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecurityKim Hammar
 
Self-learning systems for cyber security
Self-learning systems for cyber securitySelf-learning systems for cyber security
Self-learning systems for cyber securityKim Hammar
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingKim Hammar
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Kim Hammar
 
Security challenges in Ethereum smart contract programming (ver. 2)
Security challenges in Ethereum smart contract programming (ver. 2)Security challenges in Ethereum smart contract programming (ver. 2)
Security challenges in Ethereum smart contract programming (ver. 2)Sergei Tikhomirov
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...Kim Hammar
 
Coco co-desing and co-verification of masked software implementations on cp us
Coco   co-desing and co-verification of masked software implementations on cp usCoco   co-desing and co-verification of masked software implementations on cp us
Coco co-desing and co-verification of masked software implementations on cp usRISC-V International
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Security challenges in Ethereum smart contract programming
Security challenges in Ethereum smart contract programmingSecurity challenges in Ethereum smart contract programming
Security challenges in Ethereum smart contract programmingSergei Tikhomirov
 
Lecture #8: Clark-Wilson & Chinese Wall Model for Multilevel Security
Lecture #8: Clark-Wilson & Chinese Wall Model for Multilevel SecurityLecture #8: Clark-Wilson & Chinese Wall Model for Multilevel Security
Lecture #8: Clark-Wilson & Chinese Wall Model for Multilevel SecurityDr. Ramchandra Mangrulkar
 
Nse seminar 4_dec_hammar_stadler
Nse seminar 4_dec_hammar_stadlerNse seminar 4_dec_hammar_stadler
Nse seminar 4_dec_hammar_stadlerKim Hammar
 
Huntpedia
HuntpediaHuntpedia
HuntpediaJc Sv
 
[Bucharest] Attack is easy, let's talk defence
[Bucharest] Attack is easy, let's talk defence[Bucharest] Attack is easy, let's talk defence
[Bucharest] Attack is easy, let's talk defenceOWASP EEE
 
Presentatie professor Hartel Dialogues House, 28 mrt 2012
Presentatie professor Hartel Dialogues House, 28 mrt 2012Presentatie professor Hartel Dialogues House, 28 mrt 2012
Presentatie professor Hartel Dialogues House, 28 mrt 2012thesocialreporters
 
Towards an authority-free marketplace for personal IoT data (Personal Data fr...
Towards an authority-free marketplace for personal IoT data (Personal Data fr...Towards an authority-free marketplace for personal IoT data (Personal Data fr...
Towards an authority-free marketplace for personal IoT data (Personal Data fr...Paolo Missier
 

Ähnlich wie Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021 (20)

Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber Security
 
Self-learning systems for cyber security
Self-learning systems for cyber securitySelf-learning systems for cyber security
Self-learning systems for cyber security
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.
 
Security challenges in Ethereum smart contract programming (ver. 2)
Security challenges in Ethereum smart contract programming (ver. 2)Security challenges in Ethereum smart contract programming (ver. 2)
Security challenges in Ethereum smart contract programming (ver. 2)
 
The red book
The red book  The red book
The red book
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
 
Coco co-desing and co-verification of masked software implementations on cp us
Coco   co-desing and co-verification of masked software implementations on cp usCoco   co-desing and co-verification of masked software implementations on cp us
Coco co-desing and co-verification of masked software implementations on cp us
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Quantum Safety in Certified Cryptographic Modules
Quantum Safety in Certified Cryptographic ModulesQuantum Safety in Certified Cryptographic Modules
Quantum Safety in Certified Cryptographic Modules
 
Security challenges in Ethereum smart contract programming
Security challenges in Ethereum smart contract programmingSecurity challenges in Ethereum smart contract programming
Security challenges in Ethereum smart contract programming
 
Lecture #8: Clark-Wilson & Chinese Wall Model for Multilevel Security
Lecture #8: Clark-Wilson & Chinese Wall Model for Multilevel SecurityLecture #8: Clark-Wilson & Chinese Wall Model for Multilevel Security
Lecture #8: Clark-Wilson & Chinese Wall Model for Multilevel Security
 
Computing security
Computing securityComputing security
Computing security
 
Nse seminar 4_dec_hammar_stadler
Nse seminar 4_dec_hammar_stadlerNse seminar 4_dec_hammar_stadler
Nse seminar 4_dec_hammar_stadler
 
Huntpedia
HuntpediaHuntpedia
Huntpedia
 
Nt1330 Unit 4 Dthm Paper
Nt1330 Unit 4 Dthm PaperNt1330 Unit 4 Dthm Paper
Nt1330 Unit 4 Dthm Paper
 
[Bucharest] Attack is easy, let's talk defence
[Bucharest] Attack is easy, let's talk defence[Bucharest] Attack is easy, let's talk defence
[Bucharest] Attack is easy, let's talk defence
 
How secure are your systems
How secure are your systemsHow secure are your systems
How secure are your systems
 
Presentatie professor Hartel Dialogues House, 28 mrt 2012
Presentatie professor Hartel Dialogues House, 28 mrt 2012Presentatie professor Hartel Dialogues House, 28 mrt 2012
Presentatie professor Hartel Dialogues House, 28 mrt 2012
 
Towards an authority-free marketplace for personal IoT data (Personal Data fr...
Towards an authority-free marketplace for personal IoT data (Personal Data fr...Towards an authority-free marketplace for personal IoT data (Personal Data fr...
Towards an authority-free marketplace for personal IoT data (Personal Data fr...
 

Mehr von Kim Hammar

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesKim Hammar
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHKim Hammar
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion ResponseKim Hammar
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlKim Hammar
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionKim Hammar
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security AutomationKim Hammar
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Kim Hammar
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Kim Hammar
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseKim Hammar
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Kim Hammar
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Kim Hammar
 
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...Kim Hammar
 
Självlärande system för cybersäkerhet
Självlärande system för cybersäkerhetSjälvlärande system för cybersäkerhet
Självlärande system för cybersäkerhetKim Hammar
 
MuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupMuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupKim Hammar
 

Mehr von Kim Hammar (17)

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTH
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion Response
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security Automation
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber Defense
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.
 
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
 
Självlärande system för cybersäkerhet
Självlärande system för cybersäkerhetSjälvlärande system för cybersäkerhet
Självlärande system för cybersäkerhet
 
MuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupMuZero - ML + Security Reading Group
MuZero - ML + Security Reading Group
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Kürzlich hochgeladen (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021

  • 1. 1/23 Learning Intrusion Prevention Policies Through Optimal Stopping1 CNSM 2021 | International Conference on Network and Service Management Kim Hammar & Rolf Stadler kimham@kth.se & stadler@kth.se Division of Network and Systems Engineering KTH Royal Institute of Technology Oct 28, 2021 1 Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv: 2106.07160 [cs.AI].
  • 2. 2/23 Challenges: Evolving and Automated Attacks I Challenges: I Evolving & automated attacks I Complex infrastructures Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 3. 2/23 Goal: Automation and Learning I Challenges I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 4. 2/23 Approach: Game Model & Reinforcement Learning I Challenges: I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods I Our Approach: I Formal models of network attack and defense I Use reinforcement learning to learn policies. I Incorporate learned policies in self-learning systems. Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 5. 3/23 Use Case: Intrusion Prevention I A Defender owns an infrastructure I Consists of connected components I Components run network services I Defender defends the infrastructure by monitoring and active defense I An Attacker seeks to intrude on the infrastructure I Has a partial view of the infrastructure I Wants to compromise specific components I Attacks by reconnaissance, exploitation and pivoting Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 6. 4/23 Use Case: Intrusion Prevention I A Defender owns an infrastructure I Consists of connected components I Components run network services I Defender defends the infrastructure by monitoring and active defense I An Attacker seeks to intrude on the infrastructure I Has a partial view of the infrastructure I Wants to compromise specific components I Attacks by reconnaissance, exploitation and pivoting Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 We formulate this use case as an Optimal Stopping problem
  • 7. 5/23 Background on Optimal Stopping Problems I The General Problem: I A Markov process (st)T t=1 is observed sequentially I Two options per t: (i) continue to observe; or (ii) stop I Find the optimal stopping time τ∗ : τ∗ = arg max τ Eτ " τ−1 X t=1 γt−1 RC st st+1 + γτ−1 RS sτ sτ # (1) where RS ss0 & RC ss0 are the stop/continue rewards I History: I Studied in the 18th century to analyze a gambler’s fortune I Formalized by Abraham Wald in 1947 I Since then it has been generalized and developed by (Chow, Shiryaev & Kolmogorov, Bather, Bertsekas, etc.) I Applications & Use Cases: I Change detection, machine replacement, hypothesis testing, gambling, selling decisions, queue management, advertisement scheduling, the secretary problem, etc.
  • 8. 5/23 Background on Optimal Stopping Problems I The General Problem: I A Markov process (st)T t=1 is observed sequentially I Two options per t: (i) continue to observe; or (ii) stop I History: I Studied in the 18th century to analyze a gambler’s fortune I Formalized by Abraham Wald in 19472 I Since then it has been generalized and developed by (Chow3 , Shiryaev & Kolmogorov4 , Bather5 , Bertsekas6 , etc.) 2 Abraham Wald. Sequential Analysis. Wiley and Sons, New York, 1947. 3 Y. Chow, H. Robbins, and D. Siegmund. “Great expectations: The theory of optimal stopping”. In: 1971. 4 Albert N. Shirayev. Optimal Stopping Rules. Reprint of russian edition from 1969. Springer-Verlag Berlin, 2007. 5 John Bather. Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions. USA: John Wiley and Sons, Inc., 2000. isbn: 0471976490. 6 Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
  • 9. 5/23 Background on Optimal Stopping Problems I The General Problem: I A Markov process (st)T t=1 is observed sequentially I Two options per t: (i) continue to observe; or (ii) stop I History: I Studied in the 18th century to analyze a gambler’s fortune I Formalized by Abraham Wald in 1947 I Since then it has been generalized and developed by (Chow, Shiryaev & Kolmogorov, Bather, Bertsekas, etc.) I Applications & Use Cases: I Change detection7 , selling decisions8 , queue management9 , advertisement scheduling10 , etc. 7 Alexander G. Tartakovsky et al. “Detection of intrusions in information systems by sequential change-point methods”. In: Statistical Methodology 3.3 (2006). issn: 1572-3127. doi: https://doi.org/10.1016/j.stamet.2005.05.003. url: https://www.sciencedirect.com/science/article/pii/S1572312705000493. 8 Jacques du Toit and Goran Peskir. “Selling a stock at the ultimate maximum”. In: The Annals of Applied Probability 19.3 (2009). issn: 1050-5164. doi: 10.1214/08-aap566. url: http://dx.doi.org/10.1214/08-AAP566. 9 Arghyadip Roy et al. “Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes”. In: CoRR (2019). http://arxiv.org/abs/1912.10325. eprint: 1912.10325. 10 Vikram Krishnamurthy, Anup Aprem, and Sujay Bhatt. “Multiple stopping time POMDPs: Structural results & application in interactive advertising on social media”. In: Automatica 95 (2018), pp. 385–398. issn: 0005-1098. doi: https://doi.org/10.1016/j.automatica.2018.06.013. url: https://www.sciencedirect.com/science/article/pii/S0005109818303054.
  • 10. 5/23 Background on Optimal Stopping Problems I The General Problem: I A Markov process (st )T t=1 is observed sequentially I Two options per t: (i) continue to observe; or (ii) stop I History: I Studied in the 18th century to analyze a gambler’s fortune I Formalized by Abraham Wald in 1947 I Since then it has been generalized and developed by (Chow, Shiryaev & Kolmogorov, Bather, Bertsekas, etc.) I Applications & Use Cases: I Change detection11 , selling decisions12 , queue management13 , advertisement scheduling14 , intrusion prevention15 etc. 11 Alexander G. Tartakovsky et al. “Detection of intrusions in information systems by sequential change-point methods”. In: Statistical Methodology 3.3 (2006). issn: 1572-3127. doi: https://doi.org/10.1016/j.stamet.2005.05.003. url: https://www.sciencedirect.com/science/article/pii/S1572312705000493. 12 Jacques du Toit and Goran Peskir. “Selling a stock at the ultimate maximum”. In: The Annals of Applied Probability 19.3 (2009). issn: 1050-5164. doi: 10.1214/08-aap566. url: http://dx.doi.org/10.1214/08-AAP566. 13 Arghyadip Roy et al. “Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes”. In: CoRR (2019). http://arxiv.org/abs/1912.10325. eprint: 1912.10325. 14 Vikram Krishnamurthy, Anup Aprem, and Sujay Bhatt. “Multiple stopping time POMDPs: Structural results & application in interactive advertising on social media”. In: Automatica 95 (2018), pp. 385–398. issn: 0005-1098. doi: https://doi.org/10.1016/j.automatica.2018.06.013. url: https://www.sciencedirect.com/science/article/pii/S0005109818303054. 15 Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv: 2106.07160 [cs.AI].
  • 11. 6/23 Formulating Intrusion Prevention as a Stopping Problem time-step t = 1 t t = T Episode Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 I Intrusion Prevention as Optimal Stopping Problem: I The system evolves in discrete time-steps. I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can make L stops. I Each stop is associated with a defensive action I The final stop shuts down the infrastructure. I Based on the observations, when is it optimal to stop? I We formalize this problem with a POMDP
  • 12. 6/23 Formulating Intrusion Prevention as a Stopping Problem time-step t = 1 t t = T Episode Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 I Intrusion Prevention as Optimal Stopping Problem: I The system evolves in discrete time-steps. I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can make L stops. I Each stop is associated with a defensive action I The final stop shuts down the infrastructure. I Based on the observations, when is it optimal to stop? I We formalize this problem with a POMDP
  • 13. 6/23 Formulating Intrusion Prevention as a Stopping Problem time-step t = 1 t t = T Episode Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 I Intrusion Prevention as Optimal Stopping Problem: I The system evolves in discrete time-steps. I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can make L stops. I Each stop is associated with a defensive action I The final stop shuts down the infrastructure. I Based on the observations, when is it optimal to stop? I We formalize this problem with a POMDP
  • 14. 6/23 Formulating Intrusion Prevention as a Stopping Problem Intrusion event time-step t = 1 Intrusion ongoing t t = T Episode Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 I Intrusion Prevention as Optimal Stopping Problem: I The system evolves in discrete time-steps. I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can make L stops. I Each stop is associated with a defensive action I The final stop shuts down the infrastructure. I Based on the observations, when is it optimal to stop? I We formalize this problem with a POMDP
  • 15. 6/23 Formulating Intrusion Prevention as a Stopping Problem Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that affect the intrusion Episode Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 I Intrusion Prevention as Optimal Stopping Problem: I The system evolves in discrete time-steps. I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can make L stops. I Each stop is associated with a defensive action I The final stop shuts down the infrastructure. I Based on the observations, when is it optimal to stop? I We formalize this problem with a POMDP
  • 16. 6/23 Formulating Intrusion Prevention as a Stopping Problem Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that affect the intrusion Episode Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 I Intrusion Prevention as Optimal Stopping Problem: I The system evolves in discrete time-steps. I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can make L stops. I Each stop is associated with a defensive action I The final stop shuts down the infrastructure. I Based on the observations, when is it optimal to stop? I We formalize this problem with a POMDP
  • 17. 6/23 Formulating Intrusion Prevention as a Stopping Problem Intrusion event time-step t = 1 Intrusion ongoing t t = T Early stopping times Stopping times that affect the intrusion Episode Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 I Intrusion Prevention as Optimal Stopping Problem: I The system evolves in discrete time-steps. I Defender observes the infrastructure (IDS, log files, etc.). I An intrusion occurs at an unknown time. I The defender can make L stops. I Each stop is associated with a defensive action I The final stop shuts down the infrastructure. I Based on the observations, when is it optimal to stop? I We formalize this problem with a POMDP
  • 18. 7/23 A Partially Observed MDP Model for the Defender I States: I Intrusion state st ∈ {0, 1}, terminal ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z fXYZ (∆x, ∆y, ∆z|st, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ t ≥ 1 t ≥ 2 intrusion starts Qt = 1 stop t ≥ It + Tint intrusion succeeds 5 10 15 20 25 intrusion start time t 0.0 0.5 1.0 CDF I t (t) It ∼ Ge(p = 0.2)
  • 19. 7/23 A Partially Observed MDP Model for the Defender I States: I Intrusion state st ∈ {0, 1}, terminal ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z fXYZ (∆x, ∆y, ∆z|st, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ t ≥ 1 t ≥ 2 intrusion starts Qt = 1 stop t ≥ It + Tint intrusion succeeds 5 10 15 20 25 intrusion start time t 0.0 0.5 1.0 CDF I t (t) It ∼ Ge(p = 0.2)
  • 20. 7/23 A Partially Observed MDP Model for the Defender I States: I Intrusion state st ∈ {0, 1}, terminal ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z fXYZ (∆x, ∆y, ∆z|st, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ t ≥ 1 t ≥ 2 intrusion starts Qt = 1 stop t ≥ It + Tint intrusion succeeds 5 10 15 20 25 intrusion start time t 0.0 0.5 1.0 CDF I t (t) It ∼ Ge(p = 0.2)
  • 21. 7/23 A Partially Observed MDP Model for the Defender I States: I Intrusion state st ∈ {0, 1}, terminal ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z fXYZ (∆x, ∆y, ∆z|st, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ t ≥ 1 t ≥ 2 intrusion starts Qt = 1 stop t ≥ It + Tint intrusion succeeds 5 10 15 20 25 intrusion start time t 0.0 0.5 1.0 CDF I t (t) It ∼ Ge(p = 0.2)
  • 22. 7/23 A Partially Observed MDP Model for the Defender I States: I Intrusion state st ∈ {0, 1}, terminal ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z fXYZ (∆x, ∆y, ∆z|st, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ t ≥ 1 t ≥ 2 intrusion starts Qt = 1 stop t ≥ It + Tint intrusion succeeds 5 10 15 20 25 intrusion start time t 0.0 0.5 1.0 CDF I t (t) It ∼ Ge(p = 0.2)
  • 23. 7/23 A Partially Observed MDP Model for the Defender I States: I Intrusion state st ∈ {0, 1}, terminal ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z fXYZ (∆x, ∆y, ∆z|st, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ t ≥ 1 t ≥ 2 intrusion starts Qt = 1 stop t ≥ It + Tint intrusion succeeds 5 10 15 20 25 intrusion start time t 0.0 0.5 1.0 CDF I t (t) It ∼ Ge(p = 0.2)
  • 24. 7/23 A Partially Observed MDP Model for the Defender I States: I Intrusion state st ∈ {0, 1}, terminal ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z fXYZ (∆x, ∆y, ∆z|st, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ t ≥ 1 t ≥ 2 intrusion starts Qt = 1 stop t ≥ It + Tint intrusion succeeds 5 10 15 20 25 intrusion start time t 0.0 0.5 1.0 CDF I t (t) It ∼ Ge(p = 0.2)
  • 25. 7/23 A Partially Observed MDP Model for the Defender I States: I Intrusion state st ∈ {0, 1}, terminal ∅. I Observations: I Severe/Warning IDS Alerts (∆x, ∆y), Login attempts ∆z fXYZ (∆x, ∆y, ∆z|st, It, t) I Actions: I “Stop” (S) and “Continue” (C) I Rewards: I Reward: security and service. Penalty: false alarms and intrusions I Transition probabilities: I Bernoulli process (Qt)T t=1 ∼ Ber(p) defines intrusion start It ∼ Ge(p) I Objective and Horizon: I max Eπθ hPT∅ t=1 r(st, at) i , T∅ 0 1 ∅ t ≥ 1 t ≥ 2 intrusion starts Qt = 1 stop t ≥ It + Tint intrusion succeeds 5 10 15 20 25 intrusion start time t 0.0 0.5 1.0 CDF I t (t) It ∼ Ge(p = 0.2) We analyze the optimal policy using optimal stopping theory
  • 26. 8/23 Belief States and Stopping Set I Belief States: I The belief state bt ∈ B is defined as bt(st) = P[st|ht] I bt is a sufficient statistic of st based on ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H I B is the unit (|S| − 1)-simplex I Characterizing the Optimal Policy π∗: I To characterize the optimal policy π∗ we partition B based on optimal actions. I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and bt(1) = P[st = 1|ht] I Since bt(0) + bt(1) = 1, bt is completely characterized by bt(1), (bt(0) = 1 − bt(1)) I Hence, B is the unit interval [0, 1] I Stopping set S = {b(1) ∈ [0, 1] : π∗ b(1) = S} I Continue set C = {b(1) ∈ [0, 1] : π∗ b(1) = C}
  • 27. 8/23 Belief States and Stopping Set I Belief States: I The belief state bt ∈ B is defined as bt(st) = P[st|ht] I bt is a sufficient statistic of st based on ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H I B is the unit (|S| − 1)-simplex B(3): 2-dimensional unit-simplex 0.25 0.55 0.2 (1, 0, 0) (0, 1, 0) (0, 0, 1) (0.25, 0.55, 0.2) B(2): 1-dimensional unit-simplex (1, 0) (0, 1) 0.4 0.6 (0.4, 0.6)
  • 28. 8/23 Belief States and Stopping Set I Belief States: I The belief state bt ∈ B is defined as bt(st) = P[st|ht] I bt is a sufficient statistic of st based on ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H I B is the unit (|S| − 1)-simplex I Characterizing the Optimal Policy π∗: I To characterize the optimal policy π∗ we partition B based on optimal actions. I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and bt(1) = P[st = 1|ht] I Since bt(0) + bt(1) = 1, bt is completely characterized by bt(1), (bt(0) = 1 − bt(1)) I Hence, B is the unit interval [0, 1] I Stopping set S = {b(1) ∈ [0, 1] : π∗ b(1) = S} I Continue set C = {b(1) ∈ [0, 1] : π∗ b(1) = C}
  • 29. 8/23 Belief States and Stopping Set I Belief States: I The belief state bt ∈ B is defined as bt(st) = P[st|ht] I bt is a sufficient statistic of st based on ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H I B is the unit (|S| − 1)-simplex I Characterizing the Optimal Policy π∗: I To characterize the optimal policy π∗ we partition B based on optimal actions. I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and bt(1) = P[st = 1|ht] I Since bt(0) + bt(1) = 1, bt is completely characterized by bt(1), (bt(0) = 1 − bt(1)) I Hence, B is the unit interval [0, 1] I Stopping set S = {b(1) ∈ [0, 1] : π∗ b(1) = S} I Continue set C = {b(1) ∈ [0, 1] : π∗ b(1) = C}
  • 30. 8/23 Belief States and Stopping Set I Belief States: I The belief state bt ∈ B is defined as bt(st) = P[st|ht] I bt is a sufficient statistic of st based on ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H I B is the unit (|S| − 1)-simplex I Characterizing the Optimal Policy π∗: I To characterize the optimal policy π∗ we partition B based on optimal actions. I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and bt(1) = P[st = 1|ht] I Since bt(0) + bt(1) = 1, bt is completely characterized by bt(1), (b(0) = 1 − b(1)) I Hence, B is the unit interval [0, 1] I Stopping set S = {b(1) ∈ [0, 1] : π∗ b(1) = S} I Continue set C = {b(1) ∈ [0, 1] : π∗ b(1) = C}
  • 31. 8/23 Belief States and Stopping Set I Belief States: I The belief state bt ∈ B is defined as bt(st) = P[st|ht] I bt is a sufficient statistic of st based on ht = (ρ1, a1, o1, . . . , at−1, ot) ∈ H I B is the unit (|S| − 1)-simplex I Characterizing the Optimal Policy π∗: I To characterize the optimal policy π∗ we partition B based on optimal actions. I st ∈ {0, 1}. bt has two components: bt(0) = P[st = 0|ht] and bt(1) = P[st = 1|ht] I Since bt(0) + bt(1) = 1, bt is completely characterized by bt(1), (bt(0) = 1 − bt(1)) I Hence, B is the unit interval [0, 1] I Stopping set S = {b(1) ∈ [0, 1] : π∗ b(1) = S} I Continue set C = {b(1) ∈ [0, 1] : π∗ b(1) = C}
  • 32. 9/23 Threshold Properties of the Optimal Defender Policy 0 1 b(1) belief space B = [0, 1]
  • 33. 9/23 Threshold Properties of the Optimal Defender Policy 0 1 α∗ b(1) belief space B = [0, 1]
  • 34. 9/23 Threshold Properties of the Optimal Defender Policy π∗(b(1)) 0 0 1 1 α∗ action a b(1) continue stop belief space B = [0, 1] continue set C stopping set S threshold
  • 35. 10/23 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Simulation System Reinforcement Learning POMDP Model Model estimation Automation Self-learning systems
  • 36. 10/23 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Simulation System Reinforcement Learning POMDP Model Model estimation Automation Self-learning systems
  • 37. 11/23 Emulation System Σ Configuration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.
  • 38. 11/23 Emulation System Σ Configuration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.
  • 39. 12/23 From Emulation to Simulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)
  • 40. 12/23 From Emulation to Simulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)
  • 41. 12/23 From Emulation to Simulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)
  • 42. 13/23 System Identification: Estimated Empirical Distributions 0 100 200 300 400 500 600 10−3 ˆ f XY Z (∆x) # Severe IDS Alerts ∆x 0 50 100 150 200 250 300 10−3 10−1 ˆ f XY Z (∆y) # Warning IDS Alerts ∆y 0 20 40 60 80 100 10−2 ˆ f XY Z (∆z) # Login Attempts ∆z vs Novice vs Experienced vs Expert No intrusion
  • 43. 14/23 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Simulation System Reinforcement Learning POMDP Model Model estimation Automation Self-learning systems
  • 44. 15/23 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=1 γt−1 rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ hPT t=1 γt−1 r(st, at) i I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ ∇θ log πθ(a|h) | {z } actor Aπθ (h, a) | {z } critic # I Method: 1. Simulate a series of POMDP episodes 2. Use episode outcomes and trajectories to estimate ∇θJ(θ) 3. Update policy πθ with stochastic gradient ascent 4. Continue until convergence Agent Environment at st+1 rt+1
  • 45. 15/23 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=1 γt−1 rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ hPT t=1 γt−1 r(st, at) i I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ ∇θ log πθ(a|h) | {z } actor Aπθ (h, a) | {z } critic # I Method: 1. Simulate a series of POMDP episodes 2. Use episode outcomes and trajectories to estimate ∇θJ(θ) 3. Update policy πθ with stochastic gradient ascent 4. Continue until convergence Agent Environment at st+1 rt+1
  • 46. 15/23 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=1 γt−1 rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ hPT t=1 γt−1 r(st, at) i I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ ∇θ log πθ(a|s) | {z } actor Aπθ (s, a) | {z } critic # I Method: 1. Simulate a series of POMDP episodes 2. Use episode outcomes and trajectories to estimate ∇θJ(θ) 3. Update policy πθ with stochastic gradient ascent 4. Continue until convergence Agent Environment at st+1 rt+1
  • 47. 15/23 Policy Optimization in the Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=1 γt−1 rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eπθ hPT t=1 γt−1 r(st, at) i I Maximize J(θ) by stochastic gradient ascent ∇θJ(θ) = Eπθ ∇θ log πθ(a|h) | {z } actor Aπθ (h, a) | {z } critic # I Method: 1. Simulate a series of POMDP episodes 2. Use episode outcomes and trajectories to estimate ∇θJ(θ) 3. Update policy πθ with stochastic gradient ascent 4. Continue until convergence Agent Environment at st+1 rt+1
  • 48. 16/23 Our Method for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Simulation System Reinforcement Learning POMDP Model Model estimation Automation Self-learning systems
  • 49. 17/23 The Target Infrastructure I Topology: I 30 Application Servers, 1 Gateway/IDS (Snort), 3 Clients, 1 Attacker, 1 Defender I Services I 31 SSH, 8 HTTP, 1 DNS, 1 Telnet, 2 FTP, 1 MongoDB, 2 SMTP, 2 Teamspeak 3, 22 SNMP, 12 IRC, 1 Elasticsearch, 12 NTP, 1 Samba, 19 PostgreSQL I RCE Vulnerabilities I 1 CVE-2010-0426, 1 CVE-2014-6271, 1 SQL Injection, 1 CVE-2015-3306, 1 CVE-2016-10033, 1 CVE-2015-5602, 1 CVE-2015-1427, 1 CVE-2017-7494 I 5 Brute-force vulnerabilities I Operating Systems I 23 Ubuntu-20, 1 Debian 9:2, 1 Debian Wheezy, 6 Debian Jessie, 1 Kali Attacker Clients . . . Defender 1 IDS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 Target infrastructure.
  • 50. 18/23 Emulating the Client Population Client Functions Application servers 1 HTTP, SSH, SNMP, ICMP N2, N3, N10, N12 2 IRC, PostgreSQL, SNMP N31, N13, N14, N15, N16 3 FTP, DNS, Telnet N10, N22, N4 Table 1: Emulated client population; each client interacts with application servers using a set of functions at short intervals.
  • 51. 19/23 Emulating the Defender’s Actions Action Command in the Emulation Stop iptables -A INPUT -i eth0 -j DROP Table 2: Command used to implement the defender’s stop action.
  • 52. 20/23 Emulating the Attacker’s Actions Time-steps t Actions 1–It ∼ Ge(0.2) (Intrusion has not started) It + 1–It + 7 Recon, brute-force attacks (SSH,Telnet,FTP) on N2, N4, N10, login(N2, N4, N10), backdoor(N2, N4, N10), Recon It + 8–It + 11 CVE-2014-6271 on N17, SSH brute-force attack on N12, login (N17, N12), backdoor(N17, N12) It + 12–X + 16 CVE-2010-0426 exploit on N12, Recon SQL-Injection on N18, login(N18), backdoor(N18) It + 17–It + 22 Recon, CVE-2015-1427 on N25, login(N25) Recon, CVE-2017-7494 exploit on N27, login(N27) Table 3: Attacker actions to emulate an intrusion.
  • 53. 21/23 Learning Intrusion Prevention Policies through Optimal Stopping 0 1000 2000 3000 4000 # policy updates −100 0 100 200 Reward per episode 0 1000 2000 3000 4000 # policy updates 2 4 6 8 Episode length (steps) 0 1000 2000 3000 4000 # policy updates 0.2 0.4 0.6 0.8 1.0 P[intrusion interrupted] 0 1000 2000 3000 4000 # policy updates 0.0 0.2 0.4 0.6 0.8 1.0 P[early stopping] 0 1000 2000 3000 4000 # policy updates 1 2 3 4 Uninterrupted intrusion t Learned πθ vs NoisyAttacker Learned πθ vs StealthyAttacker t = 6 baseline (x + y) ≥ 1 baseline Upper bound π∗ Learning curves of training defender policies against static attackers.
  • 54. 22/23 Threshold Properties of the Learned Policies 0 100 200 300 400 # total alerts x + y 0.0 0.5 1.0 πθ(stop|x + y) πθ vs StealthyAttacker πθ vs NoisyAttacker 0 25 50 75 100 # login attempts z 0.00 0.05 0.10 0.15 πθ(stop|z) πθ vs StealthyAttacker πθ vs NoisyAttacker 0 50 100 150 200 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 πθ(stop|x, y) vs StealthyAttacker 0 50 100 150 200 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 πθ(stop|x, y) vs NoisyAttacker soft thresholds never stop severe alerts x warning alerts y warning alerts y severe alerts x
  • 55. 23/23 Conclusions Future Work I Conclusions: I We develop a method to find learn intrusion prevention policies I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement learning and (5) domain randomization and generalization. I We formulate intrusion prevention as a optimal stopping problem I We present a POMDP model of the use case I We apply the stopping theory to establish structural results of the optimal policy I Our research plans: I Extending the theoretical model I Relaxing simplifying assumptions (e.g. more dynamic defender actions) I Active attacker I Multiple stops I Evaluation on real world infrastructures