3. Introduction
● Many applications require being able to decide whether a new observation
belongs to the same distribution as existing observations (it is an inlier), or
should be considered as different (it is an outlier).
● Often, this ability is used to clean real data sets.
● Inliers are labeled 1, while outliers are labeled -1.
4. Novelty Detection
● Consider a data set of n observations from the same distribution
described by p features.
● Consider now that we add one more observation to that data set.
● Is the new observation so different from the others that we can doubt it is
regular?
● It is about to learn a rough, close frontier delimiting the contour of the
initial observations distribution, plotted in embedding p-dimensional
space.
6. Novelty Detection using OneClassSVM
● Training data is not polluted.
● One-class SVM is an unsupervised
algorithm that learns a decision
function for novelty detection:
classifying new data as similar or
different to the training set.
7. Outlier Detection
● Separate regular observation from the polluting ones.
● Three ways of doing outlier detection
Elliptic Envelope IsolationForest Local Outlier Factor
8. Elliptical Envelop
● One common way of performing outlier
detection is to assume that the regular
data come from a known distribution
(e.g. data are Gaussian distributed).
● It tries to define the “shape” of the data,
and can define outlying observations as
observations which stand far enough
from the fit shape.
9. Isolation Forest
● It’s an efficient way of performing
outlier detection in high-dimensional
datasets is to use random forests.
● Built on the basis of decision trees
● Outliers lie further away from regular
observation.
● Random partitioning produces
noticeably shorter paths for
anomalies.
10. Local Outlier Factor
● It measures the local density
deviation of a given data point with
respect to its neighbors.
● The idea is to detect the samples
that have a substantially lower
density than their neighbors.
11. Handling Outliers
● Manual Analysis
● Dropping them
● Generating alerts
● Creating new feature marking outliers
12. Clustering Method - DBSCAN
● A density based clustering method
● N is an outlier point that lies in no
cluster and it is not ‘density
reachable’ nor ‘density connected’
to any other point. Thus this point
will have “his own cluster”.
14. Visit : www.zekeLabs.com for more details
THANK YOU
Let us know how can we help your organization to Upskill the
employees to stay updated in the ever-evolving IT Industry.
Get in touch:
www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com