Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
FDA_SAKEC2018.pptx
1. Statistics: Unlocking the Power of Data Lock5
Financial DATA ANALYTICS
Dr. M.Vijayalakshmi, VESIT
4th jan 2018, SAKEC Mumbai
2. Statistics: Unlocking the Power of Data Lock5
Financial Data
The financial industry has always been driven by data.
Today, Big Data is prevalent at various levels of this field, ranging from
the financial services sector to capital markets.
The availability of Big Data in this domain has opened up new avenues
for innovation and has offered immense opportunities for growth and
sustainability.
At the same time, it has presented several new challenges that must be
overcome to gain the maximum value out of it.
4. Statistics: Unlocking the Power of Data Lock5
Motivation
There has been an explosion in the velocity, variety and volume of financial
data. Social media activity, mobile interactions, server logs, real-time market
feeds, customer service records, transaction details, information from existing
databases – there’s no end to the flood.
To make sense of these giant data sets, companies are increasingly turning to
data scientists for answers. These numbers gurus are:
Capturing and analyzing new sources of data, building predictive models and running
live simulations of market events
Using technologies such as Hadoop, NoSQL and Storm to tap into non-traditional data
sets (e.g., geolocation, sentiment data) and integrate them with more traditional
numbers (e.g., trade data)
Finding and storing increasingly diverse data in its raw form for future analysis
They’ve been aided in this quest by the development of cloud-based data
storage and the surge of sophisticated (and sometimes free or open-source)
analytics tools.
5. Statistics: Unlocking the Power of Data Lock5
Important Applications of Financial
Data Analytics
1. Predictive Analytics / Trading
2. Sentiment Analysis
3. Financial Fraud
4. Credit Scoring Ratings
5. Pricing
6. Customer Segmentation
7. Know Your Customer
6. Statistics: Unlocking the Power of Data Lock5
Sentiment Analysis
Sentiment analysis (aka opinion mining) applies natural-language
processing, text analysis and computational linguistics to source material
to discover what folks really think.
Several big Businesses like MarketPsy Capital, Think Big Analytics and
MarketPsych Data are using it to:
Build algorithms around market sentiment data (e.g., Twitter feeds) that
can short the market when disasters (e.g., storms, terrorist attacks) occur
Track trends, monitor the launch of new products, respond to issues and
improve overall brand perception
Analyze unstructured voice recordings from call centers and recommend
ways to reduce customer churn, up-sell and cross-sell products and detect
fraud
Some data companies are even acting as intermediaries, collecting and
selling sentiment indicators to retail investors.
7. Statistics: Unlocking the Power of Data Lock5
Automated Risk Credit Management
Internet finance companies are finding ways to approve loans and manage risk.
Aliloan (from AliBaba) is an automated online system that provides flexible
micro-loans to entrepreneurial online vendors.
To gauge whether a vendor is creditworthy, Alibaba collects data from its e-
commerce and payment platforms and analyzes transaction records, customer
ratings, shipping records and a host of other info.
These findings are confirmed by third-party verification and cross-checked
against external data sets (e.g., customs, tax data, electricity records, etc.).
Once the loan is granted, Alibaba continues to monitor the use of funds and
assess the business’s strategic development.
Entrepreneurs in emerging markets are also reaping the benefits. Like Aliloan,
companies such as Kreditech and Lenddo provide automated small loans based
on innovative credit scoring techniques. In these cases, much of the score is
calculated from applicants’ online social networking data.
8. Statistics: Unlocking the Power of Data Lock5
Real Time Analytics
In days of yore, financial institutions were hampered by the lag-time between data
collection and data analysis. Real-time analytics short-circuits this problem and provides
the industry with new ways to:
Fight Financial Fraud: Banks and credit card companies routinely analyze account
balances, spending patterns, credit history, employment details, location and a load of
other data points to determine whether transactions are above aboard. If suspicious
activity is detected, they can immediately suspend the account and alert the owner.
Improve Credit Ratings: A continuous feed of online data means credit ratings can
be updated in real time. This provides lenders with a more accurate picture of a
customer’s assets, business operations and transaction history.
Provide More Accurate Pricing: Progressive Insurance already tailors its policies to
account for a customer’s changing financial situation. In the Internet of Things, data
from automobile sensors will also help insurance companies issues its policy holders
with warnings about accidents, traffic jams and weather conditions. That makes for
safer drivers and fewer payouts
9. Statistics: Unlocking the Power of Data Lock5
Customer Segmentation
Like every other industry on the planet, banks and financial
institutions are hungry to know more about the people using their
products and services. And though they already store a ton of data
– from credit scores to day-to-day transactions – they’re not too
proud to look for it elsewhere.
This kind of customer segmentation allows them to:
Offer customized product offerings and services
Improve existing profitable relationships and avoid customer churn
Create better marketing campaigns and more attractive product offerings
Tailor product development to specific customer segments
10. Statistics: Unlocking the Power of Data Lock5
Predictive Analytics
By combining segmentation with predictive analytics, companies can also cut down on
risk. For example, to decide whether certain customers are likely to pay off their credit
cards, some major banks use technology developed by the company Sqrrl. This analysis
takes into account the demographic characteristics of customers’ neighborhoods and
makes calculated predictions.
Similar strides have been made in forecasting market behavior. Once upon a time (e.g.,
2009), high-frequency trading – the speedy exchange of securities – was hugely
lucrative. With competition came a drop in profits and the need for a new strategy.
HFT traders adapted by employing strategic sequential trading, using big data analytics
to identify specific market participants and anticipate their future actions. In a field of
breakneck speed, this gives HFT traders an unmistakable advantage.
By studying search volume data provided by Google Trends, they were able to identify
online precursors for stock market moves. Their results suggest that increases in search
volume for financially relevant search terms usually precede big losses in financial
markets.
11. Statistics: Unlocking the Power of Data Lock5
Analytics of Financial Times Series
A vast majority of Financial data occurs in the form of a times series
Stock prices (ticker data)
Asset prices
Customer Numbers
Etc
So Financial Data Analytics places a lot of importance on Financial times
series analytics
12. Statistics: Unlocking the Power of Data Lock5
Examples of financial time series
Daily log returns of Apple stock: 2007 to 2016 (10 years)
BSE index
Quarterly earnings of Coca-Cola Company: 1983-2009 Seasonal time
series useful in
earning forecasts
pricing weather related derivatives (e.g. energy)
modeling intraday behavior of asset returns
Exchange rate between US Dollar vs Re
Size of insurance claims Values
High-frequency financial data: Tick-by-tick data of stock, etc
13. 13
Mining Time-Series Data
A time series is a sequence of data points, measured typically at
successive times, spaced at (often uniform) time intervals
Time series analysis: A subfield of statistics, comprises methods that
attempt to understand such time series, often either to understand the
underlying context of the data points or to make forecasts (or
predictions)
Methods for time series analyses
Frequency-domain methods: Model-free analyses, well-suited to
exploratory investigations
spectral analysis vs. wavelet analysis
Time-domain methods: Auto-correlation and cross-correlation
analysis
Motif-based time-series analysis
Applications
Financial: stock price, inflation
Industry: power consumption
Scientific: experiment results
Meteorological: precipitation
14. Statistics: Unlocking the Power of Data Lock5 14
Time-Series Data Analysis: Prediction &
Regression Analysis
(Numerical) prediction is similar to classification
construct a model
use model to predict continuous or ordered value for a given input
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions
Major method for prediction: regression
model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis
Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
15. Statistics: Unlocking the Power of Data Lock5 15
What is Regression?
Modeling the relationship between one response variable and one or
more predictor variables
Analyzing the confidence of the model
E.g, height v.s weight
16. Statistics: Unlocking the Power of Data Lock5 16
Regression Yields Analytical Model
Discrete data points →Analytical model
General relationship
Easy calculation
Further analysis
Application - Prediction
17. Statistics: Unlocking the Power of Data Lock5 17
Application - Detrending
Obtain the trend for irregular data series
Subtract trend
Reveal oscillations
trend
18. Statistics: Unlocking the Power of Data Lock5 18
Linear Regression - Single Predictor
Model is linear
y = w0 + w1 x
where w0 (y-intercept) and w1
(slope) are regression coefficients
Method of least squares:
y: response
variable
x: predictor
variable
w1
w0
| |
1
| |
2
1
( )( )
1
( )
D
i i
i
D
i
i
x x y y
x x
w
x
w
y
w
1
0
19. Statistics: Unlocking the Power of Data Lock5 19
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
E.g., for 2-D data or
y = w0 + w1 x1+ w2 x2
Solvable by
Extension of least square method
(XTX ) W=Y →W = (XTX ) -1Y
Commercial software (SAS, S-Plus) x1
x2
y
Linear Regression – Multiple Predictor
20. Statistics: Unlocking the Power of Data Lock5 20
Nonlinear Regression with Linear Method
Polynomial regression model
E.g., y = w0 + w1 x + w2 x2 + w3 x3
Let x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
Log-linear regression model
E. g., y = exp(w0 + w1 x + w2 x2 + w3 x3 )
Let y’=log(y)
y’= w0 + w1 x + w2 x2 + w3 x3
21. Statistics: Unlocking the Power of Data Lock5 21
Generalized Linear Regression
Response y
Distribution function in the exponential family
Variance of y depends on E( y), not a constant
E( y) = g-1( w0 + w1 x + w2 x2 + w3 x3 )
Examples
Logistic regression (binomial regression): probability of some
event occurring
Poisson regression: number of customers
…
References: Nelder and Wedderburn, 1972; McCullagh and
Nelder, 1989
22. 22
Regression Tree (Breiman et al., 1984)
Partition the domain space
Leaf: (1) a continuous-valued
prediction; (2) average value
23. Statistics: Unlocking the Power of Data Lock5 23
Model Tree
Leaf – a linear equation
More general than regression tree
Figure source: http://datamining.ihe.nl/research/model-trees.htm
24. Statistics: Unlocking the Power of Data Lock5 24
Regression Trees and Model Trees
Regression tree: proposed in CART system (Breiman et al. 1984)
CART: Classification And Regression Trees
Each leaf stores a continuous-valued prediction
It is the average value of the predicted attribute for the training tuples that
reach the leaf
Model tree: proposed by Quinlan (1992)
Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
A more general case than regression tree
Regression and model trees tend to be more accurate than linear
regression when the data cannot be represented well by a simple
linear model
25. Statistics: Unlocking the Power of Data Lock5 25
A time series can be illustrated as a time-series graph
which describes a point moving with the passage of time
26. Statistics: Unlocking the Power of Data Lock5 26
Categories of Time-Series Movements
Categories of Time-Series Movements
Long-term or trend movements (trend curve): general direction in
which a time series is moving over a long interval of time
Cyclic movements or cycle variations: long term oscillations about a
trend line or curve
e.g., business cycles, may or may not be periodic
Seasonal movements or seasonal variations
i.e, almost identical patterns that a time series appears to follow
during corresponding months of successive years.
Irregular or random movements
Time series analysis: decomposition of a time series into these four
basic movements
Additive Modal: TS = T + C + S + I
Multiplicative Modal: TS = T C S I
27. Statistics: Unlocking the Power of Data Lock5
Estimation of Trend Curve
The freehand method
Fit the curve by looking at the graph
Costly and barely reliable for large-scaled data mining
The least-square method
Find the curve minimizing the sum of the squares of the deviation of points on
the curve from the corresponding data points
The moving-average method
27
28. Statistics: Unlocking the Power of Data Lock5 28
Moving Average
Moving average of order n
Smoothes the data
Eliminates cyclic, seasonal and irregular movements
Loses the data at the beginning or end of a series
Sensitive to outliers (can be reduced by weighted moving
average)
29. Statistics: Unlocking the Power of Data Lock5 29
Trend Discovery in Time-Series (1):
Estimation of Seasonal Variations
Seasonal index
Set of numbers showing the relative values of a variable during the
months of the year
E.g., if the sales during October, November, and December are 80%,
120%, and 140% of the average monthly sales for the whole year,
respectively, then 80, 120, and 140 are seasonal index numbers for
these months
Deseasonalized data
Data adjusted for seasonal variations for better trend and cyclic
analysis
Divide the original monthly data by the seasonal index numbers for
the corresponding months
30. Statistics: Unlocking the Power of Data Lock5
February 2, 2023 Data Mining: Concepts and Techniques 30
Seasonal Index
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8 9 10 11 12
Month
Seasonal Index
Raw data from
http://www.bbk.ac.uk/mano
p/man/docs/QII_2_2003%2
0Time%20series.pdf
31. Statistics: Unlocking the Power of Data Lock5
Trend Discovery in Time-Series (2)
Estimation of cyclic variations
If (approximate) periodicity of cycles occurs, cyclic index can be constructed in
much the same manner as seasonal indexes
Estimation of irregular variations
By adjusting the data for trend, seasonal and cyclic variations
With the systematic analysis of the trend, cyclic, seasonal, and irregular
components, it is possible to make long- or short-term predictions with
reasonable quality
31
32. Statistics: Unlocking the Power of Data Lock5 32
Similarity Search in Time-Series Analysis
Normal database query finds exact match
Similarity search finds data sequences that differ only
slightly from the given query sequence
Two categories of similarity queries
Whole matching: find a sequence that is similar to the query
sequence
Subsequence matching: find all pairs of similar sequences
Typical Applications
Financial market
Market basket data analysis
Scientific databases
Medical diagnosis
33. Statistics: Unlocking the Power of Data Lock5 33
Data Transformation
Many techniques for signal analysis require the data to be
in the frequency domain
Usually data-independent transformations are used
The transformation matrix is determined a priori
discrete Fourier transform (DFT)
discrete wavelet transform (DWT)
The distance between two signals in the time domain is
the same as their Euclidean distance in the frequency
domain
34. Statistics: Unlocking the Power of Data Lock5 34
Discrete Fourier Transform
DFT does a good job of concentrating energy in the first
few coefficients
If we keep only first a few coefficients in DFT, we can
compute the lower bounds of the actual distance
Feature extraction: keep the first few coefficients (F-index)
as representative of the sequence
35. Statistics: Unlocking the Power of Data Lock5 35
DFT (continued)
Parseval’s Theorem
The Euclidean distance between two signals in the time
domain is the same as their distance in the frequency
domain
Keep the first few (say, 3) coefficients underestimates the
distance and there will be no false dismissals!
1
0
2
1
0
2
|
|
|
|
n
f
f
n
t
t X
x
|
]
)[
(
]
)[
(
|
|
]
[
]
[
|
3
0
2
0
2
f
n
t
f
Q
F
f
S
F
t
Q
t
S
36. Statistics: Unlocking the Power of Data Lock5 36
Multidimensional Indexing in Time-Series
Multidimensional index construction
Constructed for efficient accessing using the first few Fourier coefficients
Similarity search
Use the index to retrieve the sequences that are at most a certain small distance
away from the query sequence
Perform post-processing by computing the actual distance between sequences in
the time domain and discard any false matches
37. Statistics: Unlocking the Power of Data Lock5
Subsequence Matching
Break each sequence into a set of pieces of window with length w
Extract the features of the subsequence inside the window
Map each sequence to a “trail” in the feature space
Divide the trail of each sequence into “subtrails” and represent each of
them with minimum bounding rectangle
Use a multi-piece assembly algorithm to search for longer sequence
matches
37
39. Statistics: Unlocking the Power of Data Lock5
Enhanced Similarity Search Methods
Allow for gaps within a sequence or differences in offsets or amplitudes
Normalize sequences with amplitude scaling and offset translation
Two subsequences are considered similar if one lies within an envelope of
width around the other, ignoring outliers
Two sequences are said to be similar if they have enough non-
overlapping time-ordered pairs of similar subsequences
Parameters specified by a user or expert: sliding window size, width of an
envelope for similarity, maximum gap, and matching fraction
39
40. Statistics: Unlocking the Power of Data Lock5 40
Steps for Performing a Similarity Search
Atomic matching
Find all pairs of gap-free windows of a small length that are
similar
Window stitching
Stitch similar windows to form pairs of large similar
subsequences allowing gaps between atomic matches
Subsequence Ordering
Linearly order the subsequence matches to determine whether
enough similar pieces exist
41. Statistics: Unlocking the Power of Data Lock5 41
Similar Time Series Analysis
VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund group
42. Statistics: Unlocking the Power of Data Lock5 42
Sequence Distance
A function that measures the differentness of two
sequences (of possibly unequal length)
Example: Euclidean Distance between TS Q,C
n
i i
i c
q
C
Q
D 1
2
)
(
)
,
(
43. Statistics: Unlocking the Power of Data Lock5 43
Motif: Basic Concepts
What is a motif? A previously unknown, frequently
occurring sequential pattern
Match: Given subsequences Q,C ⊆ T,
C is a match for Q iff for some R
Non-Trivial Match: C = T[p..*], Q = T[q..*] and C match Q.
If p = q or ∄ non-match N = T[s..*] such that s between p,q
then match is non-trivial.
(i.e. C,Q must be separated by a non-match)
1-Motif: the subsequence with most non-trivial matches
(least variance decides ties)
k-Motif: Ck such that D(Ck,Ci) > 2R ∀i ∈ [1,k)
R
C
Q
D
)
,
(
44. Statistics: Unlocking the Power of Data Lock5 44
SAX: Symbolic Aggregate approXimation
Dim. Reduction/Compression
“Symbolic Aggregate approXimation”
SAX : ℝ → ∑
SAX : ↦ ccbaabbbabcbcb
Essentially an alphabet over the Piecewise Aggregate
Approximation (PAA) rank
Faster, simpler, more compression, yet on par with DFT,
DWT and other dim. reductions
46. Statistics: Unlocking the Power of Data Lock5 46
SAX Algorithm
Parameters: alphabet size, word (segment) length (or output
rate)
1.Select probability distribution for TS
2.z-Normalize TS
3.PAA: Within each time interval, calculate aggregated value
(mean) of the segment
4.Partition TS range by equal-area partitioning the PDF into
n partitions (eq. freq. binning)
5.Label each segment with arank ∈∑ for aggregate’s
corresponding partition rank
47. Statistics: Unlocking the Power of Data Lock5 47
Finding Motifs in a Time Series
EMMA Algorithm: Finds 1-(k-)motif of fixed length n
SAX Compression (Dim. Reduction)
Possible to store D(i,j) ∀(i,j) ∈ ∑∑
Allows use of various distance measures (Minkowski, Dynamic Time
Warping)
Multiple Tiers
Tier 1: Uses sliding window to hash length-w SAX subsequences
(aw addresses, total size O(m)).
Bucket B with most collisions & buckets with
MINDIST(B) < R form neighborhood of B.
Tier 2: Neighborhood is pruned using more precise ADM
algorithm. Ni with max. matches is 1-motif. Early stop if |ADM
matches| > maxk>i(|neighborhoodk|)
48. Statistics: Unlocking the Power of Data Lock5 48
Hashing
c e c a b b c b a c c e c a b b c b a c
c c c c b b c c d c
w
n
2 4 2 0 1 1 2 1 0 2
5
2 2 2 2 1 1 2 2 3 2
5
2 4 2 0 1 1 2 1 0 2
5
… …
… …
…
… …
…
…
…
…
49. Statistics: Unlocking the Power of Data Lock5
Classification in Time Series
Application: Finance,
1-Nearest Neighbor
Pros: accurate, robust, simple
Cons: time and space complexity (lazy learning); results are not
interpretable
0 200 400 600 800 1000 1200
50. Statistics: Unlocking the Power of Data Lock5
Financial Data Applications
Fraud Detection - Anomaly Analysis
51. Statistics: Unlocking the Power of Data Lock5
What are Anomalies?
Anomaly is a pattern in the data that does not conform to
the expected behavior
Also referred to as outliers, exceptions, peculiarities,
surprise, etc.
Anomalies translate to significant (often critical) real life
entities
Cyber intrusions
Credit card fraud
52. Statistics: Unlocking the Power of Data Lock5
Real World Anomalies
Credit Card Fraud
An abnormally high purchase made on a
credit card
Cyber Intrusions
A web server involved in ftp traffic
53. Statistics: Unlocking the Power of Data Lock5
Simple Example
N1 and N2 are regions of
normal behavior
Points o1 and o2 are
anomalies
Points in region O3 are
anomalies
X
Y
N1
N2
o1
o2
O3
54. Statistics: Unlocking the Power of Data Lock5
Related problems
Rare Class Mining
Chance discovery
Novelty Detection
Exception Mining
Noise Removal
Black Swan*
55. Statistics: Unlocking the Power of Data Lock5
Key Challenges
Defining a representative normal region is
challenging
The boundary between normal and outlying
behavior is often not precise
The exact notion of an outlier is different for
different application domains
Availability of labeled data for training/validation
Malicious adversaries
Data might contain noise
Normal behavior keeps evolving
56. Statistics: Unlocking the Power of Data Lock5
Data Labels
Supervised Anomaly Detection
Labels available for both normal data and anomalies
Similar to rare class mining
Semi-supervised Anomaly Detection
Labels available only for normal data
Unsupervised Anomaly Detection
No labels assumed
Based on the assumption that anomalies are very rare compared to normal data
57. Statistics: Unlocking the Power of Data Lock5
Applications of Anomaly Detection
Insurance / Credit card fraud detection
Anti-Money Laundering (AML)
Fraud
Identity Theft and Fake Account Registration
Risk Modeling
Account Takeover
Promotion Credit Abuse
Customer Behavior Analytics
Cyber Security
58. Fraud Detection
Fraud detection refers to detection of criminal activities
occurring in commercial organizations
Malicious users might be the actual customers of the organization
or might be posing as a customer (also known as identity theft).
Types of fraud
Credit card fraud
Insurance claim fraud
Mobile / cell phone fraud
Insider trading
Challenges
Fast and accurate real-time detection
Misclassification cost is very high
59. Statistics: Unlocking the Power of Data Lock5
Classification Based Techniques
Main idea: build a classification model for normal (and anomalous (rare))
events based on labeled training data, and use it to classify each new
unseen event
Classification models must be able to handle skewed (imbalanced) class
distributions
Categories:
Supervised classification techniques
Require knowledge of both normal and anomaly class
Build classifier to distinguish between normal and known anomalies
Semi-supervised classification techniques
Require knowledge of normal class only!
Use modified classification model to learn the normal behavior and then detect any
deviations from normal behavior as anomalous
60. Statistics: Unlocking the Power of Data Lock5
Classification Based Techniques
Advantages:
Supervised classification techniques
Models that can be easily understood
High accuracy in detecting many kinds of known anomalies
Semi-supervised classification techniques
Models that can be easily understood
Normal behavior can be accurately learned
Drawbacks:
Supervised classification techniques
Require both labels from both normal and anomaly class
Cannot detect unknown and emerging anomalies
Semi-supervised classification techniques
Require labels from normal class
Possible high false alarm rate - previously unseen (yet legitimate) data records
may be recognized as anomalies
61. Statistics: Unlocking the Power of Data Lock5
Supervised Classification Techniques
Manipulating data records (oversampling /
undersampling / generating artificial examples)
Rule based techniques
Model based techniques
Neural network based approaches
Support Vector machines (SVM) based approaches
Bayesian networks based approaches
Cost-sensitive classification techniques
Ensemble based algorithms (SMOTEBoost,
RareBoost
62. Statistics: Unlocking the Power of Data Lock5
Semi-supervised Classification Techniques
Use modified classification model to learn the
normal behavior and then detect any deviations
from normal behavior as anomalous
Recent approaches:
Neural network based approaches
Support Vector machines (SVM) based approaches
Markov model based approaches
Rule-based approaches
63. Statistics: Unlocking the Power of Data Lock5
Nearest Neighbor Based Techniques
Key assumption: normal points have close neighbors
while anomalies are located far from other points
General two-step approach
1. Compute neighborhood for each data record
2. Analyze the neighborhood to determine whether data
record is anomaly or not
Categories:
Distance based methods
Anomalies are data points most distant from other points
Density based methods
Anomalies are data points in low density regions
64. Statistics: Unlocking the Power of Data Lock5
Clustering Based Techniques
Key assumption: normal data records belong to large and
dense clusters, while anomalies belong do not belong to any of
the clusters or form very small clusters
Categorization according to labels
Semi-supervised – cluster normal data to create modes of normal
behavior. If a new instance does not belong to any of the clusters or it is
not close to any cluster, is anomaly
Unsupervised – post-processing is needed after a clustering step to
determine the size of the clusters and the distance from the clusters is
required fro the point to be anomaly
Anomalies detected using clustering based methods can be:
Data records that do not fit into any cluster (residuals from clustering)
Small clusters
Low density clusters or local anomalies (far from other points within the
same cluster)
65. Statistics: Unlocking the Power of Data Lock5
Clustering Based Techniques
Advantages:
No need to be supervised
Easily adaptable to on-line / incremental mode suitable for
anomaly detection from temporal data
Drawbacks
Computationally expensive
Using indexing structures (k-d tree, R* tree) may alleviate this
problem
If normal points do not create any clusters the techniques
may fail
In high dimensional spaces, data is sparse and distances
between any two data records may become quite similar.
Clustering algorithms may not give any meaningful clusters
66. Statistics: Unlocking the Power of Data Lock5
Statistics Based Techniques
Data points are modeled using stochastic distribution
points are determined to be outliers depending on their
relationship with this model
Advantage
Utilize existing statistical modeling techniques to model various type
of distributions
Challenges
With high dimensions, difficult to estimate distributions
Parametric assumptions often do not hold for real data sets
67. Statistics: Unlocking the Power of Data Lock5
Types of Statistical Techniques
Parametric Techniques
Assume that the normal (and possibly anomalous) data is generated
from an underlying parametric distribution
Learn the parameters from the normal sample
Determine the likelihood of a test instance to be generated from this
distribution to detect anomalies
Non-parametric Techniques
Do not assume any knowledge of parameters
Use non-parametric techniques to learn a distribution – e.g. parzen
window estimation
68. Statistics: Unlocking the Power of Data Lock5
Information Theory Based Techniques
Compute information content in data using information
theoretic measures, e.g., entropy, relative entropy, etc.
Key idea: Outliers significantly alter the information content
in a dataset
Approach: Detect data instances that significantly alter the
information content
Require an information theoretic measure
Advantage
Operate in an unsupervised mode
Challenges
Require an information theoretic measure sensitive enough to detect
irregularity induced by very few outliers
69. Statistics: Unlocking the Power of Data Lock5
Visualization Based Techniques
Use visualization tools to observe the data
Provide alternate views of data for manual
inspection
Anomalies are detected visually
Advantages
Keeps a human in the loop
Disadvantages
Works well for low dimensional data
Can provide only aggregated or partial views for high
dimension data
70. Statistics: Unlocking the Power of Data Lock5
Visual Data Mining*
Detecting Tele-
communication fraud
Display telephone call
patterns as a graph
Use colors to identify
fraudulent telephone
calls (anomalies)
71. Statistics: Unlocking the Power of Data Lock5
Contextual Anomaly Detection
Detect context anomalies
General Approach
Identify a context around a data instance (using a set of
contextual attributes)
Determine if the data instance is anomalous w.r.t. the context
(using a set of behavioral attributes)
Assumption
All normal instances within a context will be similar (in terms of
behavioral attributes), while the anomalies will be different
72. Statistics: Unlocking the Power of Data Lock5
Contextual Attributes
Contextual attributes define a neighborhood
(context) for each instance
For example:
Spatial Context
Latitude, Longitude
Graph Context
Edges, Weights
Sequential Context
Position, Time
Profile Context
User demographics
73. Statistics: Unlocking the Power of Data Lock5
Sequential Anomaly Detection
Detect anomalous sequences in a database of
sequences, or
Detect anomalous subsequence within a sequence
Data is presented as a set of symbolic sequences
System call intrusion detection
Proteomics
Climate data
74. Statistics: Unlocking the Power of Data Lock5
Motivation for On-line Anomaly Detection
Data in many rare events applications arrives continuously
at an enormous pace
There is a significant challenge to analyze such data
Examples of such rare events applications:
Video analysis
Network traffic monitoring
Credit card fraudulent transactions
75. Statistics: Unlocking the Power of Data Lock5
Sentiment Analysis for Finance
Sentiment analysis is an emerging area where structured and
unstructured data is analyzed to generate useful insights leading to
improved performances.
Information obtained from multiple sources including news wires, macro-
economic announcements, social media, micro blogs /twitter, online
(search) information such as Google trends and Wikipedia influence both
business intelligence and performance evaluation.
This sentiment data can help investors and finance professionals to
exploit the market and manage their risk exposure.
Stock market prediction
New product review
Stock Trading
Customer Brand Building