SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
®

Anomaly Detection

© MapR Technologies, confidential

®
Agenda
• 
• 
• 
• 
• 

What is anomaly detection?
Some examples
Some generalization
More interesting examples
Sample implementation methods

© MapR Technologies, confidential

®
Who I am
•  Ted Dunning, Chief Application Architect, MapR
tdunning@mapr.com
tdunning@apache.org
@ted_dunning

•  Committer, mentor, champion, PMC member on
several Apache projects
•  Mahout, Drill, Zookeeper others

© MapR Technologies, confidential

®
Who we are
•  MapR makes the technology leading distribution
including Hadoop
•  MapR integrates real-time data semantics
directly into a system that also runs Hadoop
programs seamlessly
•  The biggest and best choose MapR
–  Google, Amazon
–  Largest credit card, retailer, health insurance, telco
–  Ping me for info
© MapR Technologies, confidential

®
What is Anomaly Detection?
•  What just happened that shouldn’t?
–  but I don’t know what failure looks like (yet)

•  Find the problem before other people see it
–  especially customers and CEO’s

•  But don’t wake me up if it isn’t really broken

© MapR Technologies, confidential

®
© MapR Technologies, confidential

®
Looks pretty
anomalous
to me

© MapR Technologies, confidential

®
Will the real anomaly
please stand up?

© MapR Technologies, confidential

®
What Are We Really Doing
•  We want action when something breaks
(dies/falls over/otherwise gets in trouble)

•  But action is expensive
•  So we don’t want false alarms
•  And we don’t want false negatives
•  We need to trade off costs

© MapR Technologies, confidential

®
A Second Look

© MapR Technologies, confidential

®
A Second Look

99.9%-ile

© MapR Technologies, confidential

®
With Spikes

© MapR Technologies, confidential

®
With Spikes

99.9%-ile including spikes

© MapR Technologies, confidential

®
How Hard Can it Be?

x

x>t?

Alarm !

t
Online
Summarizer

99.9%-ile

© MapR Technologies, confidential

®
On-line Percentile Estimates
•  Apache Mahout has on-line percentile estimator
–  very high accuracy for extreme tails
–  new in version 0.9 !!

•  What’s the big deal with anomaly detection?
•  This looks like a solved problem

© MapR Technologies, confidential

®
Spot the Anomaly

Anomaly?

© MapR Technologies, confidential

®
Maybe not!

© MapR Technologies, confidential

®
Where’s Waldo?

This is the real
anomaly

© MapR Technologies, confidential

®
Normal Isn’t Just Normal
•  What we want is a model of what is normal
•  What doesn’t fit the model is the anomaly
•  For simple signals, the model can be simple …

x ~ N(0, ε )
•  The real world is rarely so accommodating
© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
We Do Windows

© MapR Technologies, confidential

®
Windows on the World
•  The set of windowed signals is a nice model of
our original signal
•  Clustering can find the prototypes
–  Fancier techniques available using sparse coding

•  The result is a dictionary of shapes
•  New signals can be encoded by shifting, scaling
and adding shapes from the dictionary

© MapR Technologies, confidential

®
Most Common Shapes (for EKG)

© MapR Technologies, confidential

®
Reconstructed signal
Original
signal

Reconstructed
signal
Reconstruction
error

© MapR Technologies, confidential

®
An Anomaly

Original technique for
finding 1-d anomaly works
against reconstruction error

© MapR Technologies, confidential

®
Close-up of anomaly

Not what you want your
heart to do.
And not what the model
expects it to do.

© MapR Technologies, confidential

®
A Different Kind of Anomaly

© MapR Technologies, confidential

®
Model Delta Anomaly Detection

δ

+

δ>t?

Alarm !

t
Model

Online
Summarizer

99.9%-ile

© MapR Technologies, confidential

®
The Real Inside Scoop
•  The model-delta anomaly detector is really just a
mixture distribution
–  the model we know about already
–  and a normally distributed error

•  The output (delta) is (roughly) the log probability
of the mixture distribution
•  Thinking about probability distributions is good
© MapR Technologies, confidential

®
Example: Event Stream (timing)
•  Events of various types arrive at irregular
intervals
–  we can assume Poisson distribution

•  The key question is whether frequency has
changed relative to expected values
•  Want alert as soon as possible

© MapR Technologies, confidential

®
Poisson Distribution
•  Time between events is exponentially distributed

Δt ~ λ e

− λt

•  This means that long delays are exponentially
rare
− λT

P(Δt > T ) = e
− log P(Δt > T ) = λT

•  If we know λ we can select a good threshold
–  or we can pick a threshold empirically
© MapR Technologies, confidential

®
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile

© MapR Technologies, confidential

®
Example: Event Stream (sequence)
•  The scenario:
–  Web visitors are being subjected to phishing attacks
–  The hook directs the visitors to a mocked login
–  When they enter their credentials, the attacker uses
them to log in
–  We assume captcha’s or similar are part of the
authentication so the attacker has to show the images
on the login page to the visitor

•  The problem:
–  We don’t really know how the attack works
© MapR Technologies, confidential

®
Normal Event Flow
Image-1

Image-2
login
Image-3

Image-4

© MapR Technologies, confidential

®
Phishing Flow
Real login
Image1
Image1
login
Image1

Image1

Image1

Image1
login
Image1
Image1

Fake login

© MapR Technologies, confidential

®
Key Observations
•  Regardless of exact details, there are patterns
•  Event stream per user shows these patterns
•  Phishing will have different patterns at much
lower rate
•  Measuring statistical surprise gives a good
anomaly (fraud or malfunction) indicator

© MapR Technologies, confidential

®
Recap (out of order)
•  Anomaly detection is best done with a probability
model
•  Deep learning is a neat way to build this model
–  converting to symbolic dynamics simplifies life

•  -log p is a good way to convert to anomaly
measure
•  Adaptive quantile estimation works for autosetting thresholds

© MapR Technologies, confidential

®
Recap
•  Different systems require different models
•  Continuous time-series
–  sparse coding or deep learning to build signal model

•  Events in time
–  rate model base on variable rate Poisson
–  segregated rate model

•  Events with labels
–  language modeling
–  hidden Markov models
© MapR Technologies, confidential

®
How Do I Build Such a System
•  The key is to combine real-time and long-time
–  real-time evaluates data stream against model
–  long-time is how we build the model

•  Extended Lambda architecture is my favorite
•  See my other talks on slideshare.net for info
•  Ping me directly
© MapR Technologies, confidential

®
Hadoop is Not Very Real-time
Unprocessed
Data

now

t
Fully
processed

Latest
full
period

Hadoop
job takes
this long
for this
data

© MapR Technologies, confidential

®
Real-time and Long-time together
Blended
View
view

now

t
Hadoop works
great back here

Storm
works
here

© MapR Technologies, confidential

®
Who I am
•  Ted Dunning, Chief Application Architect, MapR
tdunning@mapr.com
tdunning@apache.org
@ted_dunning

•  Committer, mentor, champion, PMC member on
several Apache projects
•  Mahout, Drill, Zookeeper others

© MapR Technologies, confidential

®
© MapR Technologies, confidential

® ®

Weitere ähnliche Inhalte

Was ist angesagt?

Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
Ted Dunning
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
DataWorks Summit
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
Ted Dunning
 

Was ist angesagt? (20)

Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 

Andere mochten auch

Ariu - Workshop on Artificial Intelligence and Security - 2011
Ariu - Workshop on Artificial Intelligence and Security - 2011Ariu - Workshop on Artificial Intelligence and Security - 2011
Ariu - Workshop on Artificial Intelligence and Security - 2011
Pluribus One
 
Anomaly Detection by Mean and Standard Deviation (LT at AQ)
Anomaly Detection by Mean and Standard Deviation (LT at AQ)Anomaly Detection by Mean and Standard Deviation (LT at AQ)
Anomaly Detection by Mean and Standard Deviation (LT at AQ)
Yoshihiro Iwanaga
 
Mr201306 machine learning for computer security
Mr201306 machine learning for computer securityMr201306 machine learning for computer security
Mr201306 machine learning for computer security
FFRI, Inc.
 
Machine learning approach to anomaly detection in cyber security
Machine learning approach to anomaly detection in cyber securityMachine learning approach to anomaly detection in cyber security
Machine learning approach to anomaly detection in cyber security
IAEME Publication
 
Managed analytics overview
Managed analytics overviewManaged analytics overview
Managed analytics overview
stuartdalton
 
Anomaly Detection Via PCA
Anomaly Detection Via PCAAnomaly Detection Via PCA
Anomaly Detection Via PCA
Deepak Kumar
 

Andere mochten auch (20)

Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly detection, part 1
Anomaly detection, part 1Anomaly detection, part 1
Anomaly detection, part 1
 
Chapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionChapter 10 Anomaly Detection
Chapter 10 Anomaly Detection
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learning
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) English
 
Ariu - Workshop on Artificial Intelligence and Security - 2011
Ariu - Workshop on Artificial Intelligence and Security - 2011Ariu - Workshop on Artificial Intelligence and Security - 2011
Ariu - Workshop on Artificial Intelligence and Security - 2011
 
Anomaly Detection by Mean and Standard Deviation (LT at AQ)
Anomaly Detection by Mean and Standard Deviation (LT at AQ)Anomaly Detection by Mean and Standard Deviation (LT at AQ)
Anomaly Detection by Mean and Standard Deviation (LT at AQ)
 
Network anomaly detection based on statistical
Network anomaly detection based on statistical Network anomaly detection based on statistical
Network anomaly detection based on statistical
 
Mr201306 machine learning for computer security
Mr201306 machine learning for computer securityMr201306 machine learning for computer security
Mr201306 machine learning for computer security
 
Machine learning approach to anomaly detection in cyber security
Machine learning approach to anomaly detection in cyber securityMachine learning approach to anomaly detection in cyber security
Machine learning approach to anomaly detection in cyber security
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Tugas analisis struk tur teks anekdot
Tugas analisis struk tur teks anekdotTugas analisis struk tur teks anekdot
Tugas analisis struk tur teks anekdot
 
Portfolio of scriptwriter Johan de Rooij
Portfolio of scriptwriter Johan de RooijPortfolio of scriptwriter Johan de Rooij
Portfolio of scriptwriter Johan de Rooij
 
Managed analytics overview
Managed analytics overviewManaged analytics overview
Managed analytics overview
 
Anomaly Detection Via PCA
Anomaly Detection Via PCAAnomaly Detection Via PCA
Anomaly Detection Via PCA
 
Jim Geovedi - Machine Learning for Cybersecurity
Jim Geovedi - Machine Learning for CybersecurityJim Geovedi - Machine Learning for Cybersecurity
Jim Geovedi - Machine Learning for Cybersecurity
 
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
 
플랫폼비즈니스와 혁신의 만남
플랫폼비즈니스와 혁신의 만남플랫폼비즈니스와 혁신의 만남
플랫폼비즈니스와 혁신의 만남
 

Ähnlich wie Strata 2014 Anomaly Detection

Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
DataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
 
Machine Learning Introduction introducing basics of Machine Learning
Machine Learning Introduction introducing basics of Machine LearningMachine Learning Introduction introducing basics of Machine Learning
Machine Learning Introduction introducing basics of Machine Learning
vipulkondekar
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
DataWorks Summit
 

Ähnlich wie Strata 2014 Anomaly Detection (20)

Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Using Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformUsing Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS Platform
 
201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detection201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detection
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...
Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...
Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...
 
Machine Learning Introduction introducing basics of Machine Learning
Machine Learning Introduction introducing basics of Machine LearningMachine Learning Introduction introducing basics of Machine Learning
Machine Learning Introduction introducing basics of Machine Learning
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2Buzzwords 2014 / Overview / part2
Buzzwords 2014 / Overview / part2
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the uglyObservability - the good, the bad, and the ugly
Observability - the good, the bad, and the ugly
 
Short Data Rules for Observability.pdf
Short Data Rules for Observability.pdfShort Data Rules for Observability.pdf
Short Data Rules for Observability.pdf
 

Mehr von Ted Dunning

Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 

Mehr von Ted Dunning (11)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Strata 2014 Anomaly Detection

  • 1. ® Anomaly Detection © MapR Technologies, confidential ®
  • 2. Agenda •  •  •  •  •  What is anomaly detection? Some examples Some generalization More interesting examples Sample implementation methods © MapR Technologies, confidential ®
  • 3. Who I am •  Ted Dunning, Chief Application Architect, MapR tdunning@mapr.com tdunning@apache.org @ted_dunning •  Committer, mentor, champion, PMC member on several Apache projects •  Mahout, Drill, Zookeeper others © MapR Technologies, confidential ®
  • 4. Who we are •  MapR makes the technology leading distribution including Hadoop •  MapR integrates real-time data semantics directly into a system that also runs Hadoop programs seamlessly •  The biggest and best choose MapR –  Google, Amazon –  Largest credit card, retailer, health insurance, telco –  Ping me for info © MapR Technologies, confidential ®
  • 5. What is Anomaly Detection? •  What just happened that shouldn’t? –  but I don’t know what failure looks like (yet) •  Find the problem before other people see it –  especially customers and CEO’s •  But don’t wake me up if it isn’t really broken © MapR Technologies, confidential ®
  • 6. © MapR Technologies, confidential ®
  • 7. Looks pretty anomalous to me © MapR Technologies, confidential ®
  • 8. Will the real anomaly please stand up? © MapR Technologies, confidential ®
  • 9. What Are We Really Doing •  We want action when something breaks (dies/falls over/otherwise gets in trouble) •  But action is expensive •  So we don’t want false alarms •  And we don’t want false negatives •  We need to trade off costs © MapR Technologies, confidential ®
  • 10. A Second Look © MapR Technologies, confidential ®
  • 11. A Second Look 99.9%-ile © MapR Technologies, confidential ®
  • 12. With Spikes © MapR Technologies, confidential ®
  • 13. With Spikes 99.9%-ile including spikes © MapR Technologies, confidential ®
  • 14. How Hard Can it Be? x x>t? Alarm ! t Online Summarizer 99.9%-ile © MapR Technologies, confidential ®
  • 15. On-line Percentile Estimates •  Apache Mahout has on-line percentile estimator –  very high accuracy for extreme tails –  new in version 0.9 !! •  What’s the big deal with anomaly detection? •  This looks like a solved problem © MapR Technologies, confidential ®
  • 16. Spot the Anomaly Anomaly? © MapR Technologies, confidential ®
  • 17. Maybe not! © MapR Technologies, confidential ®
  • 18. Where’s Waldo? This is the real anomaly © MapR Technologies, confidential ®
  • 19. Normal Isn’t Just Normal •  What we want is a model of what is normal •  What doesn’t fit the model is the anomaly •  For simple signals, the model can be simple … x ~ N(0, ε ) •  The real world is rarely so accommodating © MapR Technologies, confidential ®
  • 20. We Do Windows © MapR Technologies, confidential ®
  • 21. We Do Windows © MapR Technologies, confidential ®
  • 22. We Do Windows © MapR Technologies, confidential ®
  • 23. We Do Windows © MapR Technologies, confidential ®
  • 24. We Do Windows © MapR Technologies, confidential ®
  • 25. We Do Windows © MapR Technologies, confidential ®
  • 26. We Do Windows © MapR Technologies, confidential ®
  • 27. We Do Windows © MapR Technologies, confidential ®
  • 28. We Do Windows © MapR Technologies, confidential ®
  • 29. We Do Windows © MapR Technologies, confidential ®
  • 30. We Do Windows © MapR Technologies, confidential ®
  • 31. We Do Windows © MapR Technologies, confidential ®
  • 32. We Do Windows © MapR Technologies, confidential ®
  • 33. We Do Windows © MapR Technologies, confidential ®
  • 34. We Do Windows © MapR Technologies, confidential ®
  • 35. Windows on the World •  The set of windowed signals is a nice model of our original signal •  Clustering can find the prototypes –  Fancier techniques available using sparse coding •  The result is a dictionary of shapes •  New signals can be encoded by shifting, scaling and adding shapes from the dictionary © MapR Technologies, confidential ®
  • 36. Most Common Shapes (for EKG) © MapR Technologies, confidential ®
  • 38. An Anomaly Original technique for finding 1-d anomaly works against reconstruction error © MapR Technologies, confidential ®
  • 39. Close-up of anomaly Not what you want your heart to do. And not what the model expects it to do. © MapR Technologies, confidential ®
  • 40. A Different Kind of Anomaly © MapR Technologies, confidential ®
  • 41. Model Delta Anomaly Detection δ + δ>t? Alarm ! t Model Online Summarizer 99.9%-ile © MapR Technologies, confidential ®
  • 42. The Real Inside Scoop •  The model-delta anomaly detector is really just a mixture distribution –  the model we know about already –  and a normally distributed error •  The output (delta) is (roughly) the log probability of the mixture distribution •  Thinking about probability distributions is good © MapR Technologies, confidential ®
  • 43. Example: Event Stream (timing) •  Events of various types arrive at irregular intervals –  we can assume Poisson distribution •  The key question is whether frequency has changed relative to expected values •  Want alert as soon as possible © MapR Technologies, confidential ®
  • 44. Poisson Distribution •  Time between events is exponentially distributed Δt ~ λ e − λt •  This means that long delays are exponentially rare − λT P(Δt > T ) = e − log P(Δt > T ) = λT •  If we know λ we can select a good threshold –  or we can pick a threshold empirically © MapR Technologies, confidential ®
  • 45. Converting Event Times to Anomaly 99.9%-ile 99.99%-ile © MapR Technologies, confidential ®
  • 46. Example: Event Stream (sequence) •  The scenario: –  Web visitors are being subjected to phishing attacks –  The hook directs the visitors to a mocked login –  When they enter their credentials, the attacker uses them to log in –  We assume captcha’s or similar are part of the authentication so the attacker has to show the images on the login page to the visitor •  The problem: –  We don’t really know how the attack works © MapR Technologies, confidential ®
  • 49. Key Observations •  Regardless of exact details, there are patterns •  Event stream per user shows these patterns •  Phishing will have different patterns at much lower rate •  Measuring statistical surprise gives a good anomaly (fraud or malfunction) indicator © MapR Technologies, confidential ®
  • 50. Recap (out of order) •  Anomaly detection is best done with a probability model •  Deep learning is a neat way to build this model –  converting to symbolic dynamics simplifies life •  -log p is a good way to convert to anomaly measure •  Adaptive quantile estimation works for autosetting thresholds © MapR Technologies, confidential ®
  • 51. Recap •  Different systems require different models •  Continuous time-series –  sparse coding or deep learning to build signal model •  Events in time –  rate model base on variable rate Poisson –  segregated rate model •  Events with labels –  language modeling –  hidden Markov models © MapR Technologies, confidential ®
  • 52. How Do I Build Such a System •  The key is to combine real-time and long-time –  real-time evaluates data stream against model –  long-time is how we build the model •  Extended Lambda architecture is my favorite •  See my other talks on slideshare.net for info •  Ping me directly © MapR Technologies, confidential ®
  • 53. Hadoop is Not Very Real-time Unprocessed Data now t Fully processed Latest full period Hadoop job takes this long for this data © MapR Technologies, confidential ®
  • 54. Real-time and Long-time together Blended View view now t Hadoop works great back here Storm works here © MapR Technologies, confidential ®
  • 55. Who I am •  Ted Dunning, Chief Application Architect, MapR tdunning@mapr.com tdunning@apache.org @ted_dunning •  Committer, mentor, champion, PMC member on several Apache projects •  Mahout, Drill, Zookeeper others © MapR Technologies, confidential ®
  • 56. © MapR Technologies, confidential ® ®