1. Local Outlier Factor
Lab Report: Lab Development and Application of Data Mining and
Learning Systems 2015
Amr Koura
Abstract: Outlier detection has become an important problem in many
real world applications. In some applications like intrusion detection , finding
outlier become more important than finding common pattern. In this paper ,
we are going to discuss one of outlier detection algorithms called ”LOF:Local
outlier factor” algorithm.we will show the algorithm in two modes, first is
”batch mode” , where input data set is known in advance, while second mode
is ”incremental mode” , where outlier should be detected on the fly during re-
ceiving streaming data. The paper will also show the implementation details
for the two modes and the integration with open source data mining project
called ”RealKD”.In the first part,LOF batch mode, the paper will provide
theoretical explanation for algorithm and will discuss the implementation
details of the algorithm, while in the second part,The incremental mode ,the
paper will show how the algorithm computes the outlier efficiently such as
insertion and deletion of points will effect only limited number of nearest
neighbors and don’t depend on the total number of points N in the data
set.the implementation details and integration with realKD library will also
be discussed.
1 Introduction
knowledge discovery in database (KDD) are focusing on identifying understandable
knowledge from the existing data.Most of KDD algorithms concentrates on comput-
ing patterns that matches large portion of objects in dataset. However , in application
like intrusion detection , detecting rare events that deviates from the majority, is more
important than identifying common patterns.
Most of outlier detection algorithms rely on clustering algorithm. For clustering al-
gorithms , outliers are points reside outside the cluster and considered to be noise. this
approach depends more in particular clustering algorithm and parameters.In fact,There
are very few algorithms that are directly concerned with outlier detection.
1
2. Most of outlier detection algorithms are giving the outlier as binary property so, the
points are either classified as an outlier or not. The Local outlier factor is an algorithm
that quantify the tendency of being outlier. the algorithm computes local outlier factory
for each example that show the tendency of point to be an outlier.
The algorithm depends on density based clustering , so it computes the density of
each point and compare it to the density of its neighbors. and so, the outlier factor is
local in since that only restricted neighbors for each point is taken into account.
Online detection of outlier plays an important role in many streaming applications.
Automated identification for outlier from data stream is hot research topic and has
many usage in modern applications like security application, image and multimedia
analysis.The incremental mode of LOF algorithm can detect efficiently the outlier. The
incremental LOF provide equivalent performance for batch LOF mode , and has O(N
log N) time complexity, where N is the total number of points in the data set. The paper
will do experiments on insertion and deletion of points using incremental LOF and will
compare the result with the results obtained from the batch mode test.
The paper will also discuss the implementation details for batch and incremental mode
and show how to integrate the code with an open-source project ”RealKD”.
2 Related Work
Most of previous KDD outlier detection research papers was building outlier detection
approaches based on clustering algorithms. The algorithms was optimized for clustering
purpose and considering the outliers as noise data.There was a need to build algorithms
that are designed solely to identify the outlier.
The ”LOF: Identifying Density-based Local Outliers” paper [1] was one of the very
first papers who studied LOF algorithm. in That paper , authors explains the theory
behind the LOF algorithm and explains the equation behind the algorithm.
The paper shows the problem with previous distance based algorithms using figure
1,In this figure , you can see data set with dense cluster C2 and other less dense cluster
C1 and two objects O1 and O2. according to the paper , if the distance between each
object in C1 and its nearest neighbors are larger than the distance between O2 and C2,
so we can’t find a minimum distance dmin that we use it, to classify O2 as outlier without
classify objects in C1 as outlier too. LOF was a solution for such problem because it has
local view of the points, not global view as distance-based did.our implementation for
batch mode totally is based on the equations that was provided in [1].
The demand for detecting in the data stream applications has become increasingly
important and active research area. The incremental mode of LOF algorithm in this
paper is based on [2]. In that paper , the Author shows efficient algorithm in detecting
2
3. Figure 1: distance based algorithm
http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf
the outlier in data stream application and shows that insertion or deletion of point
depends on limited numbers of neighbors and don’t depends in total number of points N
in the data set. in this paper we use the same algorithms and try it with real example.
Our implementation of incremental mode of LOF is based on the algorithms that are
provided in [2].
our implementation should integrate with open-source project called ”realKD”. re-
alKD is an open-source java library that has be designed to help users to apply KDD
algorithms to discover real knowledge from real data.
The repository for realKD library is ”https://bitbucket.org/realKD/realkd/wiki/Home”.
realKD library is published under MIT license and there are many algorithms which are
already implemented.There was an outlier detection algorithm that implemented pre-
viously using Support vector machine SVM and our algorithm supposed to be second
outlier detection algorithm for that library. we will discuss in the next section the inter-
faces that we use to implement in order to integrate successfully. and will also mention
how the user can call our algorithm with parameters from the command line interface.
3 Local Outlier Factor Algorithm
The local outlier factor algorithm is based on a concept of a local density, where the
algorithm compare the density of each point with the density of nearest neighbors. In
the following subsection , the papers will discuss the theory behind algorithm and the
implementation details for the algorithm and the integration part. In first subsection
, we will deal with the ”Batch” mode , while in second section, we will deal with the
”Incremental” mode.
3.1 LOF Batch Mode:
In this section , we will discuss the details of theory behind the LOF batch mode and the
implementation details for the algorithm and integration. the details for the algorithm
3
4. is based on [1].
3.1.1 Fomal Definition:
let K-distance(p) be the distance of the object p to the k-th nearest neighbor. set of the
k nearest neighbors includes all objects at this distance,and can be more than k objects.
we will denote the set of k nearest neighbors as Nk(p).
Nk(p) = {q ∈ D {p} | d(q, p) ≤ k − distance(p)}
Definition: reachability distance of an object p w.r.t object o:
Let K be integer number. the reachability distance of an object p with respect to
object o is defined as:
reach − distk(p, o) = max{k − distance(o), d(p, o)}
Figure 2: reach-dist(p1,o) and reach-dist(p2,o), for k=4
http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf
Figure 2 illustrate the idea of reachability distance between objects p and o. if object
p is far away from object o, like p2, then the reachability distance in that case equals to
the distance between p and o , d(p,o). but if object p is near from o, like p1, then the
reachability distance will be equals to k-distance of object o. This shows the importance
of parameter K, the higher the value of K , the more similar the reachability distance
for objects within the same neighborhood.
Definition: local reachability density of object p:
lrdK(p) = 1/
o∈NK (p)
reach−distK (p,o)
|NK (p)|
the Local reachability density of object p is the inverse of average reachability distance
based on K nearest neighbor of p.
Definition: local outlier factor of an object p:
4
5. The local outlier factor of an object p is defined as:
LOFK(p) =
o∈NK (p)
lrdK (o)
lrdK (p)
|NK (p)|
The LOF value for an object gives the tendency of that object to be an outlier.The
LOF value has a very special properties that can help to detect the outlier. The LOF
value for objects that exists deep inside the cluster is approximately equals 1, but LOF
values objects outside the cluster has values larger than 1. proof [1]
3.1.2 implementation details
In this subsection, we will show out implementation details for the batch mode LOF
algorithm.
The class ”LOFOutlier” in package ”de.unibonn.realkd.algorithms.outlier.LOF” con-
tains all the details code to implement the algorithm equation. The class extends from
abstract class ”AbstractMiningAlgorithm” which is the class that contains the logic to
be called from the realKD framework. we maintain N*N matrix called ”trainingMatrix”
that contains the distance between all data in our data set. Another N*N matrix called
”sortedTrainingMatrix” contains sorted order of indices of data according to their dis-
tance. for example , if the nearest neighbor for data with index 0 is the data with index
2, then index 6, then the first row of this matrix, will be like this:
0 2 6 ...
so for the first data , the nearest neighbor is it self (index:0), then data in index 2,
then data in index 6 , and so on. The reason behind selecting this data structure is
to optimize the speed , so in order to find K-nearest neighbor , or K- inverse nearest
neighbor ”which we will need in the incremental mode later”, it is easy to work with
these matrices to do the task fast.
the main parameter for this algorithm to work is the input data set and value of
K parameter. The class ”FractionOFLOFParameter” is the class that represents the
place holder for K value. The main logic exists in function ”concreteCall” which is the
function that will be called from the realKD once the user specify the algorithm name
”LOF”. and pass the K value. the function is calling ”computeLofValues” function
which takes the input data set and K value and call corresponding functions to compute
the LRD,LOF for each example in the data set. to test this class. the user should call the
program with the following parameters: RealKD load ”Path to input data set” ”Path
to input attribute File” ”input to group input file” run LOF ”specify Numeric target
attributes” KValue=[Value of K] for Example:
5
6. RealKD load ”/Users/XYZ/simpleTestFile/data.txt” ”/Users/XYZ/simpleTestFile/attributes.txt”
”/Users/XYZ/simpleTestFile/groups.txt” run LOF ”Numeric Target attributes=Latitude,Longitude”
KValue=3
3.2 Incremental LOF Mode
designing an incremental LOF algorithm is motivated by two goals. First,the perfor-
mance of the algorithm should be equivalent to the performance of iterated ”static”
LOF algorithm,Second because of stream data are considered to be infinite , so we need
an efficient algorithm, that can do insertion/deletion efficiently and don’t depend on the
total number of data N, otherwise the performance will be O(N2logN). The paper in
[2], shows an efficient algorithm for incremental LOF Mode for insertion and deletion.
in each insertion/deletion operation , the algorithm update limited number of neighbors,
so it doesn’t depends on total number of records N. This improves the complexity of
algorithm comparing to iterated static LOF and makes the complexityO(NlogN) rather
than O(N2logN).
In this paper we will show the details for the insertion and deletion part and our
implementation is based on algorithm in [2].
3.2.1 Insertion
In insertion part , the algorithm should keep track of the points that should be updated
”K-distance,LRD,LOF” after inserting the new data.
Figure 3 shows general framework for inserting new point in the incremental LOF.
3.2.2 Deletion
In data stream application, there is a need to delete one or more examples to resolve
because of the memory limitation and sometime because of these examples become
outdated.
Same like insertion part, the deletion part should keep track of effected examples to
update their K-distance, LRD, and LOF after deleting the required data example.
Figure 4 shows framework for deletion.
3.2.3 Implementation Details:
In this section we will show the implementation details , integration and command line
interface to call incremental LOF for both insertion and deletion.
Analogy to batch LOF, we create two classes for the incremental LOF, first ”ILO-
FOutlierAdd” class that contains the logic of insertion algorithm and, second is ”ILO-
FOutlierDelete” which contains the of deletion algorithm. For reusability reasons, both
classes are extends ”LOFOutlier” class, because we will need to reuse all functions that
6
7. Figure 3: incremental LOF insertion
http://www-ai.cs.uni-
dortmund.de/LEHRE/FACHPROJEKT/SS12/paper/outlier/pokrajac2007.pdf
compute LRD,LOF and we will need to use trainingMatrix and sortedTrainingMatrix
as well.
3.2.3.1 insertion: in case of insertion , the algorithm will need the K value ”which
was already implemented in batch mode , so we will reuse it again” and will need
the new data point to be inserted. the class that correspond to the new parameter is
”ILOFNewDataParameter”.
Again the function ”concreteCall” contains the main logic behind the insertion algo-
rithm. In this function , the code is inserting new example into data set and update
only the value of LOF values for limited number of effected neighbors as shown in the
algorithm.
To test incremental LOF addition, the user should call the program with the following
parameters:
RealKD load ”Path to input data set” ”Path to input attribute File” ”input to group
input file” run LOF ”specify Numeric target attributes” KValue=[Value of K] new-
7
8. Figure 4: incremental LOF Deletion
http://www-ai.cs.uni-
dortmund.de/LEHRE/FACHPROJEKT/SS12/paper/outlier/pokrajac2007.pdf
Point=”string delimited of new example”.
for Example, to insert data sample with attributes ”Alexandria”,”31.205753”,”29.924526”,
the user should executes:
RealKD load ”/Users/XYZ/simpleTestFile/data.txt” ”/Users/XYZ/simpleTestFile/attributes.txt”
”/Users/XYZ/simpleTestFile/groups.txt” run LOF ”Numeric Target attributes=Latitude,Longitude”
KValue=3 newPoint=”Alexandria;31.205753;29.924526”
3.2.3.2 Deletion: In case of Deletion , the algorithm will need the K value ”which
was already implemented in batch mode , so we will reuse it again” and will need
the index of data to be deleted. the class that correspond to the new parameter is
”ILOFDeleteExample”.
Again the function ”concreteCall” contains the main logic behind the deletion algo-
rithm. In this function , the code deleted the example with at the index position that
the user pass and updates only the value of LOF values for limited number of effected
neighbors as shown in the algorithm.
8
9. To test incremental LOF deletion, the user should call the program with the following
parameters: RealKD load”Path to input data set” ”Path to input attribute File” ”in-
put to group input file” run LOF ”specify Numeric target attributes” KValue=[Value
of K] deleteIndex=index to be deleted. for Example, to delete the fourth data in
data set ”index=3”, the user executes:
RealKD load ”/Users/XYZ/simpleTestFile/data.txt” ”/Users/XYZ/simpleTestFile/attributes.txt”
”/Users/XYZ/simpleTestFile/groups.txt” run LOF ”Numeric Target attributes=Latitude,Longitude”
KValue=3 deleteIndex=3
4 Experiments
In this section , we will show an experiment of running the algorithm against simple
geographical data set that contains German and Egyptian cities with their coordinates
(Longitude and Latitude).For simplicity , we will select small value for K value =3.
4.1 Running Batch Mode
In the dataset , we put nine German Cities and one Egyptian city with their coordinates
”Longitude and Latitude” and then run the algorithm with K value equals 3.
Here is the input dataset, each record contains city name,Latitude,Longitude:
Berlin 52.520 13.380
Hamburg 53.550 10.000
Munchen 48.140 11.580
Koln 50.950 6.970
Frankfurt 50.120 8.680
Dortmund 51.510 7.480
Stuttgart 48.790 9.190
Essen 51.470 7.000
Bonn 50.730 7.100
Cairo 30.3 31.14
After running the program, The algorithms compute the LOF value for all cities we
can see that the Egyptian city has big LOF value.The following is the program output
which shows the index of city along with its LOF value.
9
10. 1 1.191001325549464
2 1.1997290645736223
3 0.9628552264586343
4 0.7586643428289646
5 0.7359971660104047
6 0.7495005015494334
7 1.005038367007253
8 0.6938237614108347
9 0.8042180889618675
10 5.521620958353801
from the program output, we see that the object which outside the cluster (Cairo) has
LOF value larger than 1, while all other objects has LOF value approximately equals 1.
, now let’s add other three Egyptian cities.
Aswan 25.6833 32.6500
Alexandria 31.13 29.58
Hurghada 27.15 33.50
So, now number of Egyptian Cities =4 , and now they form their own cluster , as their
number is larger than the value of K.So when the algorithm is run ask for the nearest
3 neighbors of the Egyptian cities, the list will be 3 cities from the same cluster , and
then the LOF values will be approximately equals 1. when run the program , we get the
following output:
1 1.187791873035322
2 1.1898577015355898
3 0.968974318939152
4 0.7563190608212818
5 0.7411657187228445
6 0.7532750824303704
7 0.9872825814658073
8 0.6951376411639997
9 0.8009756608212486
10 0.77008269448957
11 0.7192686329698315
12 0.7903058153557572
13 0.7239698839427493
now , we can see that all values has approximately LOF value equals 1. and this
matches our expectation , as all examples now within clusters.
4.2 Running incremental Mode
In this section we will run incremental LOF insertion and will compute the LOF value
after that and compare the result with the result that we get from running the batch
10
11. mode in the previous subsection. we start with the same nine german cities, plus three
Egyptian Cities:
Cairo 30.3 31.14
Aswan 25.6833 32.6500
Alexandria 31.13 29.58
Then by running the incremental LOF addition , we will add fourth Egyptian city:
Hurghada;27.15;33.50 and then compute the LOF values for all cities , the user calls the
program from the command line with the following parameters:
RealKD load ”/Users/XYZ/simpleTestFile/data.txt” ”/Users/XYZ/simpleTestFile/attributes.txt”
”/Users/XYZ/simpleTestFile/groups.txt” run LOF ”Numeric Target attributes=Latitude,Longitude”
KValue=3 newPoint=”Hurghada;27.15;33.50”
Then, this is the output:
1 1.187791873035322
2 1.1898577015355898
3 0.968974318939152
4 0.7563190608212818
5 0.7411657187228445
6 0.7532750824303704
7 0.9872825814658073
8 0.6951376411639997
9 0.8009756608212486
10 0.77008269448957
11 0.7192686329698315
12 0.7903058153557572
13 0.7239698839427493
So, we get same result by running the LOF in Batch mode and in incremental mode
as well.
5 Conclusion
In this paper , we have discussed two modes of running LOF algorithm , batch mode
and incremental mode. by running experiment on simple 2-d geographical dataset , we
get the same expected result from our theoretical understanding.
The algorithm computes LOF value for each existing data example and this value
gives the tendency of this data point to be an outlier. the data examples that are exists
in the cluster has LOF value approximately equals 1, while examples outside the cluster
has LOF values larger than 1.
The incremental LOF algorithm has same performance as the iterated static algo-
rithm but has better complexity as it updates only limited number of neighbors and not
depends on total number of examples in dataset.
11
12. out implementation of Batch Mode has integrated correctly with realkd development
branch, while the the incremental mode is tested correctly in local machine and will be
merged soon with the realkd development branch.
More research are required to find optimized way to compute K-nearest neighbor and
reverse K-nearest neighbors to improve the performance. Also , more research is also
required to select suitable K value and to determine the LOF threshold for identifying
the outliers when applying the algorithm to real world data set.
References
[1] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨org Sander. Lof:
Identifying density-based local outliers. SIGMOD Rec., 29(2):93–104, May 2000.
[2] Dragoljub Pokrajac. Incremental local outlier detection for data streams. In In
Proceedings of IEEE Symposium on Computational Intelligence and Data Mining,
pages 504–515, 2007.
12