Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events

Entity Embedding-based Anomaly Detection
for Heterogeneous Categorical Events
Ting Chen♠1
, Lu-An Tang♣
, Yizhou Sun♠
, Zhengzhang Chen♣
,
Kai Zhang♣
♠
Northeastern University, ♣
NEC Labs America
{tingchen, yzsun}@ccs.neu.edu, {ltang, zchen,
kzhang}@nec-labs.com
July 14, 2016
1
Part of the work is done during ﬁrst author’s internship at NEC Labs America.
1 / 25

Introduction
Anomaly detection is important in many applications. For
example, detect anomalous activities in enterprise networks.
2 / 25

Problem statement
Heterogeneous Categorical Event. A heterogeneous categorical
event e = (a1, · · · , am) is a record contains m different categorical
attributes, and the i-th attribute value ai denotes an entity from the
type Ai .
Table 1: Examples of event in process to process interaction network.
index day hour uid src. proc. dst. proc. src. folder dst. folder
0 3 16 8 init python /sbin/ /usr/bin/
1 5 3 4 init mingetty /sbin/ /sbin/
Problem: abnormal event detection. Given a set of training
events D = {e1, · · · , en}, by assuming that most events in D are
normal, the problem is to learn a model M, so that when a new event
en+1 comes, the model M can accurately predict whether the event is
abnormal or not.
3 / 25

Traditional anomaly detection by density estimation
Estimate probability distribution over data space using kernel
density estimation, and detect anomalies/outliers with lower
probability/density.
4 / 25

Challenges
There are several challenges associated with heterogeneous
categorical events:
Large event space: m different entity types, facing
O(exp(m)) event space.
Lack of intrinsic distance measure among entities and
events: similarities between two users/machines? And two
events with different entities?
No label data
5 / 25

Motivations for our model
To overcome the lack of distance measures: using entity
embedding.
To alleviate the large event space issue:
At the model level, only consider pairwise interactions.
A maintenance program is usually triggered at midnight, but
suddenly it is trigged during the day.
A user usually connect to servers with low privilege, but
suddenly it tries to access some server with high privilege.
At the learning level, propose using noise-contrastive
estimation (NCE) with “context-dependent” noise
distribution.
6 / 25

APE model
We model the probability of a single event e = (a1, · · · , am) in
event space Ω using the following parametric form:
Pθ(e) =
exp Sθ(e)
e ∈Ω exp Sθ(e )
(1)
Where
Sθ(e) =
i,j:1≤i<j≤m
wij(vai
· vaj
) (2)
Where wij is the weight for pairwise interaction between entity
types Ai and Aj, and it is non-negative constrained, i.e.
∀i, j, wij ≥ 0. vai
is the embedding vector for entity ai.
7 / 25

APE model
……
Event
Embedding Lookup
Table
Entity Embeddings
Pairwise
Interactions
Probability
!"#
!"$
%& %'
()*
+
Figure 1: The framework of proposed method.
8 / 25

APE model
The maximum likelihood learning objective:
arg max
θ e∈D
log Pθ(e) (3)
Where we maximize likelihood of each observed events.
Recall
Pθ(e) =
exp Sθ(e)
e ∈Ω exp Sθ(e )
(4)
Where event space Ω can be prohibitively large. So we resort
to Noise-contrastive learning (NCE).
9 / 25

Noise-contrastive Estimation
With NCE, we make mainly two modifications to original
learning objective:
Treat normalization term in Pθ(e) as a parameter c:
Pθ(e) = exp
i,j:1≤i<j≤m
wij(vai
· vaj
) + c (5)
Introduce a noise distribution Pn(e), and construct a binary
classification problem, discriminating samples from data
distribution and some artificial known noise distribution.
J(θ) =Ee∼Pd
log
Pθ(e)
Pθ(e) + kPn(e)
+
kEe∼Pn
log
kPn(e)
Pθ(e) + kPn(e)
(6)
10 / 25

Stochastic gradient descent
In practice, we can use SGD for fast and online training.
For each observed event e, samples k artiﬁcial events e
from Pn(e ), and update parameters according to
stochastic gradients based on:
log σ log Pθ(e) − log kPn(e) +
e
log σ − log Pθ(e ) + log kPn(e )
(7)
Here σ(x) = 1/(1 + exp(−x)) is the sigmoid function.
The complexity of our algorithm is O(Nkm2d), where N is
the number of total observed events it is trained on, m is
the number of entity type, and d is the embedding
dimension.
11 / 25

Context-independent noise distribution
“Context-independent” noise distribution: drawing an
artiﬁcial event, independent of given event e, according to
Pfactorized
n (e) = pA1
(a1) · · · pAi
(ai)
12 / 25

Pfactorized
n (e) = pA1
(a1) · · · pAi
(ai)
Table 2: Generation of example artiﬁcial event in process to process
interaction network “context-independent” noise distribution.
day hour uid src. proc. dst. proc. src. folder dst. folder
observed event e 3 16 8 init python /sbin/ /usr/bin/
12 / 25

Pfactorized
n (e) = pA1
(a1) · · · pAi
(ai)
generated event e
12 / 25

Pfactorized
n (e) = pA1
(a1) · · · pAi
(ai)
generated event e 5
12 / 25

Pfactorized
n (e) = pA1
(a1) · · · pAi
(ai)
generated event e 5 3
12 / 25

Pfactorized
n (e) = pA1
(a1) · · · pAi
(ai)
generated event e 5 3 4
12 / 25

Pfactorized
n (e) = pA1
(a1) · · · pAi
(ai)
generated event e 5 3 4 bash mingetty / /sbin/
12 / 25

Pfactorized
n (e) = pA1
(a1) · · · pAi
(ai)
generated event e 5 3 4 bash mingetty / /sbin/
+ Simple and tractable.
- Generated samples might be too easy for the classiﬁer.
12 / 25

Context-dependent noise distribution
“Context-dependent” noise distribution: ﬁrst sample an
observed event e, then randomly sample an entity type Ai, and
ﬁnally sample a new entity ai ∼ pAi
(ai) to replace ai and form a
new negative sample e .
13 / 25

interaction network according to “context-dependent” noise
distribution.
13 / 25

distribution.
generated event e
13 / 25

distribution.
generated event e 3 16 8 init mingetty /sbin/ /usr/bin/
13 / 25

distribution.
generated event e 3 16 8 init mingetty /sbin/ /usr/bin/
+ Generated samples is harder for the classiﬁer.
- Pn(e ) ≈ Pd (e)PAi
(ai)/m is intractable.
13 / 25

We further approximate the “context-dependent” distribution
term Pn(e ) ≈ Pd (e)PAi
(ai)/m as follows.
Pd (e) is small for most events, we simply set it to some
constant l, so
log kPn(e ) ≈ log PAi
(ai) + z,
where z = log kl/m is a constant term (simply set to 0).
To compute Pn(e) for an observed event e, we will use the
expectation over all entity types as follows:
log kPn(e) ≈
i
1
m
log PAi
(ai) + z.
And again the constant z will be ignored.
14 / 25

Experimental settings
We utilize two data sets in enterprise network.
P2P. Process to process event data set.
P2I. Process to Internet Socket event data set.
We split the two-week data into two of one-weeks. The
events in the ﬁrst week are used as training set, and new events
that only appeared in the second week are used as test sets.
Generate artiﬁcial anomalies: for each observed event in test,
we generate a corresponding anomaly by replacing 1-3 entities
of different types in the event according to random sampling.
15 / 25

Data statistics
Table 4: Entity types in data sets.
Data sets Types of entity and their arities
P2P day (7), hour (24), uid (361), src proc
(778), dst proc (1752), src folder (255), dst
folder (415)
P2I day (7), hour (24), src ip (59), dst ip (184),
dst port (283), proc (91), proc folder (70),
uid (162), connection type (3)
Table 5: Statistics of the collected two-week events.
Data # week 1 # week 2 # week 2 new
P2P 95,434 107,619 53,478 (49.69%)
P2I 1,316,357 1,330,376 498,029 (37.44%)
16 / 25

Methods for comparison
We compare following models in experiments:
Condition. This method is proposed in [Das and
Schneider2007].
CompreX. This method is proposed in [Akoglu et al.2012].
APE. This is the proposed method. Noted that we use the
negative of its likelihood output as the abnormal score.
APE (no weight). This method is the same as APE, except that
instead of learning wij , we simply set ∀i, j, wij = 1, i.e. it is APE
without automatic weights learning on pairwise interactions. All
types of interactions are weighted equally.
17 / 25

Performance comparison for abnormal event detection
Table 6: Values left to slash are AUC of ROC, and ones on the right
are average precision. The last two rows (∗
marked) are averaged
over 3 smaller (1%) test samples due to long runtime of CompreX.
P2P
Models c=1 c=2 c=3
Condition 0.6296 / 0.6777 0.6795 / 0.7321 0.7137 / 0.7672
APE (no weight) 0.8797 / 0.8404 0.9377 / 0.9072 0.9688 / 0.9449
APE 0.8995 / 0.8845 0.9540 / 0.9378 0.9779 / 0.9639
CompreX∗
0.8230 / 0.7683 0.8208 / 0.7566 0.8390 / 0.7978
APE∗
0.9003 / 0.8892 0.9589 / 0.9394 0.9732 / 0.9616
P2I
Models c=1 c=2 c=3
Condition 0.7733 / 0.7127 0.8300 / 0.7688 0.8699 / 0.8165
APE (no weight) 0.8912 / 0.8784 0.9412 / 0.9398 0.9665 / 0.9671
APE 0.9267 / 0.9383 0.9669 / 0.9717 0.9838 / 0.9861
CompreX∗
0.7749 / 0.8391 0.7834 / 0.8525 0.7832 / 0.8497
APE∗
0.9291 / 0.9411 0.9656 / 0.9729 0.9829 / 0.9854
18 / 25

Performance comparison for abnormal event detection
0.0 0.2 0.4 0.6 0.8 1.0
FPR
0.0
0.2
0.4
0.6
0.8
1.0
TPR
ROC
0.0 0.2 0.4 0.6 0.8 1.0
Recall
0.0
0.2
0.4
0.6
0.8
1.0
Precision
PRC
(a) P2P abnormal event detection.
0.0 0.2 0.4 0.6 0.8 1.0
FPR
0.0
0.2
0.4
0.6
0.8
1.0
TPR
ROC
0.0 0.2 0.4 0.6 0.8 1.0
Recall
0.0
0.2
0.4
0.6
0.8
1.0
Precision
PRC
Condition
CompareX
APE (no weight)
APE
(b) P2I abnormal event detection.
Figure 2: Receiver operating characteristic curves and precision
recall curves for abnormal event detections.
19 / 25

Parameter study
Table 7: Average precision under different choice of noise
distributions.
P2P P2I
context-independent 0.8463 0.7534
context-dependent, log kPn(e) = 0 0.8176 0.7868
context-dependent, log kPn(e) = appx 0.8845 0.9383
1 2 3 4 5
Number of negative samples per entity type
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Average precision
Data
P2P
P2I
Figure 3: Performance versus number of negative samples drawn
per entity type. 20 / 25

Case study
Here are some examples of detected anomalies (events with
low probabilities):
Table 8: Detected abnormal events examples.
Data Abnormal event Ab. entity
P2P ..., src proc: bash, src folder: /home/, ... src proc
P2P ..., uid: 9 (some main user), hour: 1, ... uid
P2I ..., proc: ssh, dst port: 80, ... dst port
21 / 25

Pairwise weights visualization
day
hour
uid
src proc
dst proc
src folder
dst folder
dst folder
src folder
dst proc
src proc
uid
hour
day
0 0 0 0 0 0 0
0 0 0 0 0 0 0.82
0 0 0 0 0 0.69 6.1
0 0 0 0 5.3 3.8 0.54
0 0 0 3.5 4.4 3 2.3
0 0 2.6 3 4 2 1.5
0 0 1.8 0.5 0.54 0.99 0.58
0.0
1.5
3.0
4.5
6.0
(a) P2P events
day
hour
src ip
dst ip
dport
sproc
proc folder
uid
conn. type
conn. type
uid
proc folder
sproc
dport
dst ip
src ip
hour
day
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 2.5
0 0 0 0 0 0 0 1 1.8
0 0 0 0 0 0 3.1 1.6 2.5
0 0 0 0 0 2.1 1.2 2.4 1.6
0 0 0 0 2.9 1.7 1.8 1.4 0.94
0 0 0 1 0 0.05 0.95 6 1.3
0 0 0.73 0.8 0.78 0.79 0.7 0.94 1
0 1.4 0.42 0.91 0.78 0.76 0.85 0.32 3.2
0
1
2
3
4
5
(b) P2I events
Figure 4: Learned pairwise weights.
22 / 25

Embedding Visualization
Figure 5: 2d visualization of user embeddings. Each color indicates a
user type in the system. The left-most and right-most points are Linux
and Windows root users, respectively.
23 / 25

Embedding Visualization
10 11
12
131416
18
08
7
15
17
19
20
21 22
23
1
2 3
45
6
9
working hours
nonworking hours
Figure 6: 2d visualization of hour embeddings.
24 / 25

Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events

Ähnlich wie Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events