SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
An Introduction to
Anomaly Detection
Ken Graham
What we’ll cover
•What is Anomaly Detection?
•What’s an anomaly?
•Detecting Anomalies
•Methods and Applications
What is Anomaly Detection?
credit card fraud insurance fraud
image processing intrusion detection (cybersecurity)
text analysis sensor networks
insider threats industrial damage
• Trying to find patterns in data that are different from the
expected. 
• Some applications: 
Detecting Anomalies
So, how would we detect some of these? Let’s take a
naive approach.
1. Define a “normal” region. 
2. Observations not in the “normal” region are
anomalies. 
Will this work? 
• Boundary hard to define
• Definitions change over time
• Definitions are domain-dependent
• Labeled training data is hard to find
• Training data, is often heavily imbalanced
Types of Data
• Collection of data instances
• a data instance has a set of attributes
• Attributes can be of different types
• binary
• categorical
• continuous
• The attributes help determine the detection
method.
• The relationship between data instances is
important.
• Most existing anomaly detection techniques don’t
assume any particular relationship between the
data instances. We have to identify relationships.
Types of input data
• Sequential
• time-series, sequences of symbols
• Spatial
• each data instance is related to its neighbors
• images, vehicular traffic
• Graph
• data instances are nodes in a graph or network
Three Types of Anomalies
• 😃 There are only three. 
• 😔 No, that doesn’t make it any easier to detect
them.
• Point anomaly
• Contextual anomaly
• Collective anomaly
Point Anomaly
• Generally a single data instance. 
• Anomalous compared to the entirety of the data
• Most research focuses on point anomalies
• Can occur in any dataset
Contextual Anomaly
• Anomalous in relation to a specific context
• Context comes from how data is structured
• Context has to be specified as a part of the problem
formulation
• Each data instance can be defined using two sets of
attributes:
• contextual: determines the context (e.g. lat/long or time)
• behavioral: non-contextual characteristics of an instance
• Anomalous behavior is determined by the
behavioral attributes within a specic context
• A data instance might be a contextual anomaly in a
given context, but a data instance with identical
behavioral attributes could be considered normal in
a different context. 
• Contextual anomalies are generally found in time-
series data. Example:
• Avg monthly temp. of an area over last few years.
• 35 degrees F in winter might be normal
• 35 degrees F in summer in same place is
anomalous
• Another example: Credit card fraud
• Contextual attribute: time of purchase. 
• $100 average weekly shopping bill, except during
the Christmas week, when it reaches $1000. 
• A new purchase of $1000 in July would be
considered a contextual anomaly, since it’s
unusual for July. 
• The same amount spent during Christmas week
will be considered normal.
Collective Anomaly
• A group of data instances are anomalous
• They need not be anomalies by themselves
• Again, the relationship between the data matters
• Point | Collective problem + context = Contextual
problem
Three Types of
Anomaly Detection Methods
• Supervised
• Use labeled training data to build a predictive model
• Imbalanced data (many normal, few anomalies)
• Semi-Supervised
• Only need normal data
• Model learns how to classify normal data
• Unsupervised (no labeled data)
Applications
Credit Card Fraud
Data used
• user ID
• amount spent
• time between consecutive card usage
Credit card companies have complete, labeled data and 
user proles
Kinds of anomalies 
• point anomalies in transaction records
◦high payments
◦items never before purchased by the user
◦high rate of purchase
• contextual anomalies
◦User defines the context
▪ Each credit card user is profiled based on card usage
history. 
▪ Each new transaction compared to user profile,
flagged if it doesn’t match
◦Location defines the context
▪ Detects anomalies among transactions at a specific
geographic location. 
Cellphone Fraud
Data used 
• Call data records (CDRs)
• CDR = vector of features
◦continuous (e.g., CALL-DURATION)
◦discrete (e.g., CALLING-CITY). 
Kinds of anomalies
• point anomalies from aggregated CDR data
◦aggregated by time, user, or area
◦high volume of calls
◦calls made to unlikely destinations
Insider Trading
Data used
• Option trading data
• Stock trading data
• News
• Data is time-series or otherwise temporally sequenced.
Medical
	•	Patient records
	 ◦	Electronic Health Records (EHRs)
◦demographics, medical history, medication and allergies,
immunization status, laboratory test results, radiology images,
vital signs, personal statistics like age and weight, and
billing information
	 ◦	Electrocardiograms (ECG) and Electroencephalograms
(EEG)
	•	Temporal and/or spatial data 
Types of anomalies
	•	point anomalies
	 ◦	e.g., abnormal patient condition, instrumentation errors,
recording errors
	•	contextual
	 ◦	Disease outbreaks can be contextual anomalies 

(e.g. geo-temporal pattern of viral infections) 
	•	collective
• False negatives can cost $$$ and lives
• A colleague (David Gilmore) said: 
• "Precision saves money, recall saves lives."
Methods
Classication
• Train a model from labeled data (supervised)
• Use the model to classify other data
• Many different ways to do this
◦SVMs, PGMs, Rules
◦Neural nets have shown much promise
▪ LSTMs learn features across a sequence
▪ Autoencoders reconstruct the data, reconstruction error tells
you if data is anomalous
Recurrent Neural Nets and
LSTMs
Now we’ll look at a method or two for time-series data.
• Method needs to learn patterns present in the sequence
• Sequences can have patterns of unknown length
• Recurrent neural networks (RNNs)[1][2] let you address
sequences of data
• Detect deviations from normalcy
• Steps
◦Train the NN to predict several time steps into the future 
◦Each point in the sequence has several corresponding
predicted values made at different points in the past,
resulting in multiple error values. 
◦Compute error distribution
• More generally, to detect anomalies in a time series
◦Anomalous if prediction error is larger than expected
◦Can pick an error threshold, e.g. 2 std. dev. from the mean
Autoencoders for Anomaly Detection
• Train the autoencoder.
• If the data is sequential, you can incorporate RNNs
or LSTMs.
• Use the model to reconstruct the input.
• If the reconstruction error is above some threshold,
label it as an anomaly
Nearest-Neighbor Methods
Assumption 
• Normal data are close together, while anomalies are far away
Two Methods
1. Anomaly score is distance to kth nearest neighbor.
2. Anomaly score is the density of the neighborhood of each
point
• Distance metric affects computational complexity
• Easy to adapt to different problem domain. Just define the
distance metric
Statistical Methods
• Assumption
• Normal data lies in high probability regions,
anomalies in low probability regions
• Parametric and non-parametric methods
Parametric
• Assumes normal data is distributed according to a parametric
distribution
• Anomaly score is inverse of the PDF 
• Or, use a hypothesis test. Anomaly score can be test statistic
Examples: 
• Gaussian models => maximum likelihood estimation (MLE),
Grubb’s test and variants
• Regression models => ARIMA, ARMA
• mixtures of models
◦Assume each data point has prob. p of being an anomaly
◦N = PDF of normal data
◦A = PDF of anomalies (assume to be uniform)
◦D = PDF of all the data = pA + (1-p)N
◦Start with all points in N
◦Anomaly score comes from how much the distributions
change if you move point to A.
Non-parametric
• Histogram models
◦Does test instance fit into an existing bin?
◦Or, how determine score from the bin in which it lands
• Kernel methods estimate the data PDF and are similar to
parametric methods 
Spectral Methods
Assumption
• "Data can be embedded into a lower dimensional subspace
in which normal instances and anomalies appear signicantly
different.” - Anomaly Detection: A Survey
Main idea: 
Find a subspace where the anomalies are easy to see and
project data onto it.
Methods 
• Unsupervised or semi-supervised
• PCA
◦Project data along low variance principal components.
Anomaly projections will be high 
◦In graphs, PCA on a graph’s adjacency matrix at different
points in time, differences in principal components determines
anomaly status
• Errors in Compact Matrix Decomposition (CMD) of a graph’s
adjacency matrix determined an anomalous graph
• PCA can be expensive
Contextual Anomalies
Contextual attributes are key
• sequential: position in sequence is the context
◦time-series
◦event data (timestamped)
▪ inter-arrival time between events can be uneven
• spatial: location is the context
• graphs: the edges between data instance (the nodes) are the
context
• profiles (user defines context, like for credit card fraud)
Contextual Methods
• Convert to a point anomaly problem
• 1. identify a context for a data instance
• 2. compute anomaly score within the context with
a point anomaly method
• Use the structure of the data when breaking data
into contexts is hard (time-series and sequences)
• time-series
◦regression, RNNs
• sequences
◦Use events occurring before a particular time to predict the
event occurring at that time. 
◦If the prediction doesn't match the actual event, it's labeled rare.
◦Finite State Automata (FSA) and Hidden Markov Models
(HMMs)
to compute conditional probabilities for events in the sequence
based on previous events. 
◦Model event sequence as a Poisson process 
• graphs
Collective Anomalies 
• Hardest to detect because theirs is collective behavior.
• Relationship between data points is important
◦Sequential => find an anomalous subsequence
▪ lots of research here b/c lots of time-series and
event sequence data in the wild
◦Spatial => find an anomalous subregion
▪ image/video processing
◦Graph => find an anomalous subgraph
◦The task is to find an anomalous subset
Detecting Collective
Sequential Anomalies
Reduce to point anomaly problem:
• transform subsequences and then use a point anomaly method
• FSA, Markov Models, HMMs, CRFs for symbols
Neural Nets would be powerful here
• RNNs + LSTMs + Autoencoders: Could use a sequence to
sequence model on the subsequences and compute
reconstruction error
• For every example we’ve looked at that used FSA or HMMs,
you could use neural nets instead
Detecting Collective Spatial
Anomalies
• Most work here has been on images
• Anomaly detection in videos would likely be a combination of
techniques for spatial and sequential anomalies (collective or
otherwise). 
◦Video = sequence of images + an audio stream
• Convolutional neural networks (CNNs) have been used for
anomaly detection in images
◦Fully Convolutional Neural Network for Fast Anomaly
Detection in Crowded Scenes (2016): https://arxiv.org/abs/
1609.00866
Most important thing…
• Understand your problem before picking a method. 
• Just because a method is the most accurate doesn’t
automatically make it the best solution for your problem.

Weitere ähnliche Inhalte

Was ist angesagt?

Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData
 
Anomaly Detection Technique
Anomaly Detection TechniqueAnomaly Detection Technique
Anomaly Detection TechniqueChakrit Phain
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesHumberto Marchezi
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxImpetus Technologies
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine LearningUpekha Vandebona
 
Anomaly detection Full Article
Anomaly detection Full ArticleAnomaly detection Full Article
Anomaly detection Full ArticleMenglinLiu1
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersGianmario Spacagna
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detectionMohamed Elfadly
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionLalit Jain
 

Was ist angesagt? (20)

Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
 
Anomaly Detection Technique
Anomaly Detection TechniqueAnomaly Detection Technique
Anomaly Detection Technique
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time Series
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Anomaly detection Full Article
Anomaly detection Full ArticleAnomaly detection Full Article
Anomaly detection Full Article
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
 
Random forest
Random forestRandom forest
Random forest
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detection
 
Random Forest
Random ForestRandom Forest
Random Forest
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly Detection
 

Ähnlich wie An Introduction to Anomaly Detection

Anomalies and events keep us on our toes
Anomalies and events keep us on our toesAnomalies and events keep us on our toes
Anomalies and events keep us on our toesCSIRO
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rentalPratik Doshi
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfSaketBansal9
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptxImXaib
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new pptSalford Systems
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining IntroAsma CHERIF
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopPranab Ghosh
 
planning and decision making
planning and decision making planning and decision making
planning and decision making AdengappaUnavu
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsPyData
 
Data mining
Data mining Data mining
Data mining Shaoli Lu
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needGibDevs
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxAkash527744
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causationPeter Varhol
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION K Srinivas Rao
 
Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceDamianMingle
 

Ähnlich wie An Introduction to Anomaly Detection (20)

Anomalies and events keep us on our toes
Anomalies and events keep us on our toesAnomalies and events keep us on our toes
Anomalies and events keep us on our toes
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
planning and decision making
planning and decision making planning and decision making
planning and decision making
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
 
Data mining
Data mining Data mining
Data mining
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION
 
Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data Science
 

KĂźrzlich hochgeladen

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

KĂźrzlich hochgeladen (20)

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

An Introduction to Anomaly Detection

  • 1. An Introduction to Anomaly Detection Ken Graham
  • 2. What we’ll cover •What is Anomaly Detection? •What’s an anomaly? •Detecting Anomalies •Methods and Applications
  • 3. What is Anomaly Detection? credit card fraud insurance fraud image processing intrusion detection (cybersecurity) text analysis sensor networks insider threats industrial damage • Trying to nd patterns in data that are different from the expected.  • Some applications: 
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Detecting Anomalies So, how would we detect some of these? Let’s take a naive approach. 1. Dene a “normal” region.  2. Observations not in the “normal” region are anomalies. 
  • 12. Will this work?  • Boundary hard to dene • Denitions change over time • Denitions are domain-dependent • Labeled training data is hard to nd • Training data, is often heavily imbalanced
  • 13. Types of Data • Collection of data instances • a data instance has a set of attributes • Attributes can be of different types • binary • categorical • continuous
  • 14. • The attributes help determine the detection method. • The relationship between data instances is important. • Most existing anomaly detection techniques don’t assume any particular relationship between the data instances. We have to identify relationships.
  • 15. Types of input data • Sequential • time-series, sequences of symbols • Spatial • each data instance is related to its neighbors • images, vehicular trafc • Graph • data instances are nodes in a graph or network
  • 16. Three Types of Anomalies • 😃 There are only three.  • 😔 No, that doesn’t make it any easier to detect them. • Point anomaly • Contextual anomaly • Collective anomaly
  • 17. Point Anomaly • Generally a single data instance.  • Anomalous compared to the entirety of the data • Most research focuses on point anomalies • Can occur in any dataset
  • 18. Contextual Anomaly • Anomalous in relation to a specic context • Context comes from how data is structured • Context has to be specied as a part of the problem formulation • Each data instance can be dened using two sets of attributes: • contextual: determines the context (e.g. lat/long or time) • behavioral: non-contextual characteristics of an instance
  • 19. • Anomalous behavior is determined by the behavioral attributes within a specic context • A data instance might be a contextual anomaly in a given context, but a data instance with identical behavioral attributes could be considered normal in a different context. 
  • 20. • Contextual anomalies are generally found in time- series data. Example: • Avg monthly temp. of an area over last few years. • 35 degrees F in winter might be normal • 35 degrees F in summer in same place is anomalous
  • 21.
  • 22. • Another example: Credit card fraud • Contextual attribute: time of purchase.  • $100 average weekly shopping bill, except during the Christmas week, when it reaches $1000.  • A new purchase of $1000 in July would be considered a contextual anomaly, since it’s unusual for July.  • The same amount spent during Christmas week will be considered normal.
  • 23. Collective Anomaly • A group of data instances are anomalous • They need not be anomalies by themselves • Again, the relationship between the data matters • Point | Collective problem + context = Contextual problem
  • 24. Three Types of Anomaly Detection Methods • Supervised • Use labeled training data to build a predictive model • Imbalanced data (many normal, few anomalies) • Semi-Supervised • Only need normal data • Model learns how to classify normal data • Unsupervised (no labeled data)
  • 26. Credit Card Fraud Data used • user ID • amount spent • time between consecutive card usage Credit card companies have complete, labeled data and  user proles
  • 27. Kinds of anomalies  • point anomalies in transaction records ◦high payments ◦items never before purchased by the user ◦high rate of purchase • contextual anomalies ◦User denes the context ▪ Each credit card user is proled based on card usage history.  ▪ Each new transaction compared to user prole, flagged if it doesn’t match ◦Location denes the context ▪ Detects anomalies among transactions at a specic geographic location. 
  • 28. Cellphone Fraud Data used  • Call data records (CDRs) • CDR = vector of features ◦continuous (e.g., CALL-DURATION) ◦discrete (e.g., CALLING-CITY).  Kinds of anomalies • point anomalies from aggregated CDR data ◦aggregated by time, user, or area ◦high volume of calls ◦calls made to unlikely destinations
  • 29. Insider Trading Data used • Option trading data • Stock trading data • News • Data is time-series or otherwise temporally sequenced.
  • 30. Medical • Patient records ◦ Electronic Health Records (EHRs) ◦demographics, medical history, medication and allergies, immunization status, laboratory test results, radiology images, vital signs, personal statistics like age and weight, and billing information ◦ Electrocardiograms (ECG) and Electroencephalograms (EEG) • Temporal and/or spatial data 
  • 31. Types of anomalies • point anomalies ◦ e.g., abnormal patient condition, instrumentation errors, recording errors • contextual ◦ Disease outbreaks can be contextual anomalies  (e.g. geo-temporal pattern of viral infections)  • collective
  • 32.
  • 33. • False negatives can cost $$$ and lives • A colleague (David Gilmore) said:  • "Precision saves money, recall saves lives."
  • 35. Classication • Train a model from labeled data (supervised) • Use the model to classify other data • Many different ways to do this ◦SVMs, PGMs, Rules ◦Neural nets have shown much promise ▪ LSTMs learn features across a sequence ▪ Autoencoders reconstruct the data, reconstruction error tells you if data is anomalous
  • 36. Recurrent Neural Nets and LSTMs Now we’ll look at a method or two for time-series data. • Method needs to learn patterns present in the sequence • Sequences can have patterns of unknown length • Recurrent neural networks (RNNs)[1][2] let you address sequences of data
  • 37. • Detect deviations from normalcy • Steps ◦Train the NN to predict several time steps into the future  ◦Each point in the sequence has several corresponding predicted values made at different points in the past, resulting in multiple error values.  ◦Compute error distribution • More generally, to detect anomalies in a time series ◦Anomalous if prediction error is larger than expected ◦Can pick an error threshold, e.g. 2 std. dev. from the mean
  • 39. • Train the autoencoder. • If the data is sequential, you can incorporate RNNs or LSTMs. • Use the model to reconstruct the input. • If the reconstruction error is above some threshold, label it as an anomaly
  • 40. Nearest-Neighbor Methods Assumption  • Normal data are close together, while anomalies are far away Two Methods 1. Anomaly score is distance to kth nearest neighbor. 2. Anomaly score is the density of the neighborhood of each point • Distance metric affects computational complexity • Easy to adapt to different problem domain. Just dene the distance metric
  • 41. Statistical Methods • Assumption • Normal data lies in high probability regions, anomalies in low probability regions • Parametric and non-parametric methods
  • 42. Parametric • Assumes normal data is distributed according to a parametric distribution • Anomaly score is inverse of the PDF  • Or, use a hypothesis test. Anomaly score can be test statistic
  • 43. Examples:  • Gaussian models => maximum likelihood estimation (MLE), Grubb’s test and variants • Regression models => ARIMA, ARMA • mixtures of models ◦Assume each data point has prob. p of being an anomaly ◦N = PDF of normal data ◦A = PDF of anomalies (assume to be uniform) ◦D = PDF of all the data = pA + (1-p)N ◦Start with all points in N ◦Anomaly score comes from how much the distributions change if you move point to A.
  • 44. Non-parametric • Histogram models ◦Does test instance t into an existing bin? ◦Or, how determine score from the bin in which it lands • Kernel methods estimate the data PDF and are similar to parametric methods 
  • 45. Spectral Methods Assumption • "Data can be embedded into a lower dimensional subspace in which normal instances and anomalies appear signicantly different.” - Anomaly Detection: A Survey Main idea:  Find a subspace where the anomalies are easy to see and project data onto it.
  • 46. Methods  • Unsupervised or semi-supervised • PCA ◦Project data along low variance principal components. Anomaly projections will be high  ◦In graphs, PCA on a graph’s adjacency matrix at different points in time, differences in principal components determines anomaly status • Errors in Compact Matrix Decomposition (CMD) of a graph’s adjacency matrix determined an anomalous graph • PCA can be expensive
  • 47. Contextual Anomalies Contextual attributes are key • sequential: position in sequence is the context ◦time-series ◦event data (timestamped) ▪ inter-arrival time between events can be uneven • spatial: location is the context • graphs: the edges between data instance (the nodes) are the context • proles (user denes context, like for credit card fraud)
  • 48. Contextual Methods • Convert to a point anomaly problem • 1. identify a context for a data instance • 2. compute anomaly score within the context with a point anomaly method • Use the structure of the data when breaking data into contexts is hard (time-series and sequences)
  • 49. • time-series ◦regression, RNNs • sequences ◦Use events occurring before a particular time to predict the event occurring at that time.  ◦If the prediction doesn't match the actual event, it's labeled rare. ◦Finite State Automata (FSA) and Hidden Markov Models (HMMs) to compute conditional probabilities for events in the sequence based on previous events.  ◦Model event sequence as a Poisson process  • graphs
  • 50. Collective Anomalies  • Hardest to detect because theirs is collective behavior. • Relationship between data points is important ◦Sequential => nd an anomalous subsequence ▪ lots of research here b/c lots of time-series and event sequence data in the wild ◦Spatial => nd an anomalous subregion ▪ image/video processing ◦Graph => nd an anomalous subgraph ◦The task is to nd an anomalous subset
  • 51. Detecting Collective Sequential Anomalies Reduce to point anomaly problem: • transform subsequences and then use a point anomaly method • FSA, Markov Models, HMMs, CRFs for symbols Neural Nets would be powerful here • RNNs + LSTMs + Autoencoders: Could use a sequence to sequence model on the subsequences and compute reconstruction error • For every example we’ve looked at that used FSA or HMMs, you could use neural nets instead
  • 52. Detecting Collective Spatial Anomalies • Most work here has been on images • Anomaly detection in videos would likely be a combination of techniques for spatial and sequential anomalies (collective or otherwise).  ◦Video = sequence of images + an audio stream • Convolutional neural networks (CNNs) have been used for anomaly detection in images ◦Fully Convolutional Neural Network for Fast Anomaly Detection in Crowded Scenes (2016): https://arxiv.org/abs/ 1609.00866
  • 53. Most important thing… • Understand your problem before picking a method.  • Just because a method is the most accurate doesn’t automatically make it the best solution for your problem.