SlideShare ist ein Scribd-Unternehmen logo
1 von 63
© 2014 MapR Technologies 1
© MapR Technologies, confidential
Anomaly Detection, Time-Series Data Bases
and More
How to Find What You Didn’t
Know to Look For
© 2014 MapR Technologies 2
Agenda
• What is anomaly detection?
• Some examples
• Some generalization
• Compression == Truth
• Deep dive into deep learning
• Why this matters for time series databases
This goes beyond what was described in the abstract…
© 2014 MapR Technologies 3
Who I am
• Ted Dunning, Chief Application Architect, MapR
tdunning@mapr.com
tdunning@apache.org
Twitter @ted_dunning
• Committer, mentor, champion, PMC member on several Apache
projects
• Mahout, Drill, Zookeeper, Spark and others
• Hashtag for today’s talk #HadoopSummit
© 2014 MapR Technologies 4
Who we are
• MapR makes the technology leading distribution including
Hadoop
• MapR integrates real-time data semantics directly into a system
that also runs Hadoop programs seamlessly
• The biggest and best choose MapR
– Google, Amazon
– Largest credit card, retailer, health insurance, telco
– Ping me for info
© 2014 MapR Technologies 5
A New Look at Anomaly Detection
• O’Reilly series Practical Machine Learning 2nd volume
• Print copies available at Hadoop Summit at MapR booth
• eBook available at http://bit.ly/1jQ9QuL
• Book signing by both authors at
4pm Wed (today)
© 2014 MapR Technologies 6
What is Anomaly Detection?
• What just happened that shouldn’t?
– but I don’t know what failure looks like (yet)
• Find the problem before other people see it
– especially customers and CEO’s
• But don’t wake me up if it isn’t really broken
© 2014 MapR Technologies 7
© 2014 MapR Technologies 8
Looks pretty
anomalous
to me
© 2014 MapR Technologies 9
Will the real anomaly
please stand up?
© 2014 MapR Technologies 10
What Are We Really Doing
• We want action when something breaks
(dies/falls over/otherwise gets in trouble)
• But action is expensive
• So we don’t want false alarms
• And we don’t want false negatives
• We need to trade off costs
© 2014 MapR Technologies 11
A Second Look
© 2014 MapR Technologies 12
A Second Look
99.9%-ile
© 2014 MapR Technologies 13
Online
Summarizer
99.9%-ile
t
x > t ? Alarm !
x
How Hard Can it Be?
© 2014 MapR Technologies 14
On-line Percentile Estimates
• Apache Mahout has on-line percentile estimator
– very high accuracy for extreme tails
– new in version 0.9 !!
• What’s the big deal with anomaly detection?
• This looks like a solved problem
© 2014 MapR Technologies 15
Already Done? Etsy Skyline?
© 2014 MapR Technologies 16
What About This?
0 5 10 15
−20246810
offset+noise+pulse1+pulse2
A
B
© 2014 MapR Technologies 17
Spot the Anomaly
Anomaly?
© 2014 MapR Technologies 18
Maybe not!
© 2014 MapR Technologies 19
Where’s Waldo?
This is the real
anomaly
© 2014 MapR Technologies 20
Normal Isn’t Just Normal
• What we want is a model of what is normal
• What doesn’t fit the model is the anomaly
• For simple signals, the model can be simple …
• The real world is rarely so accommodating
x ~ N(0,e)
© 2014 MapR Technologies 21
We Do Windows
© 2014 MapR Technologies 22
We Do Windows
© 2014 MapR Technologies 23
We Do Windows
© 2014 MapR Technologies 24
We Do Windows
© 2014 MapR Technologies 25
We Do Windows
© 2014 MapR Technologies 26
We Do Windows
© 2014 MapR Technologies 27
We Do Windows
© 2014 MapR Technologies 28
We Do Windows
© 2014 MapR Technologies 29
We Do Windows
© 2014 MapR Technologies 30
We Do Windows
© 2014 MapR Technologies 31
We Do Windows
© 2014 MapR Technologies 32
We Do Windows
© 2014 MapR Technologies 33
We Do Windows
© 2014 MapR Technologies 34
We Do Windows
© 2014 MapR Technologies 35
We Do Windows
© 2014 MapR Technologies 36
Windows on the World
• The set of windowed signals is a nice model of our original signal
• Clustering can find the prototypes
– Fancier techniques available using sparse coding
• The result is a dictionary of shapes
• New signals can be encoded by shifting, scaling and adding
shapes from the dictionary
© 2014 MapR Technologies 37
Most Common Shapes (for EKG)
© 2014 MapR Technologies 38
Reconstructed signal
Original
signal
Reconstructed
signal
Reconstruction
error
< 1 bit / sample
© 2014 MapR Technologies 39
An Anomaly
Original technique for finding
1-d anomaly works against
reconstruction error
© 2014 MapR Technologies 40
Close-up of anomaly
Not what you want your
heart to do.
And not what the model
expects it to do.
© 2014 MapR Technologies 41
A Different Kind of Anomaly
© 2014 MapR Technologies 42
Model Delta Anomaly Detection
Online
Summarizer
δ > t ?
99.9%-ile
t
Alarm !
Model
-
+ δ
© 2014 MapR Technologies 43
The Real Inside Scoop
• The model-delta anomaly detector is really just a sum of random
variables
– the model we know about already
– and a normally distributed error
• The output (delta) is (roughly) the log probability of the sum
distribution (really δ2)
• Thinking about probability distributions is good
© 2014 MapR Technologies 44
Example: Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
• Want alert as soon as possible
© 2014 MapR Technologies 45
Poisson Distribution
• Time between events is exponentially distributed
• This means that long delays are exponentially rare
• If we know λ we can select a good threshold
– or we can pick a threshold empirically
Dt ~ le-lt
P(Dt > T) = e-lT
-logP(Dt > T) = lT
© 2014 MapR Technologies 46
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2014 MapR Technologies 51
Recap (out of order)
• Anomaly detection is best done with a probability model
• -log p is a good way to convert to anomaly measure
• Adaptive quantile estimation works for auto-setting thresholds
© 2014 MapR Technologies 52
Recap
• Different systems require different models
• Continuous time-series
– sparse coding to build signal model
• Events in time
– rate model base on variable rate Poisson
– segregated rate model
• Events with labels
– language modeling
– hidden Markov models
© 2014 MapR Technologies 53
But Wait! Compression is Truth
• Maximizing log πk is minimizing compressed size
– (each symbol takes -log πk bits on average)
• Maximizing log πk happens where πk = pk
– (maximum likelihood principle)
© 2014 MapR Technologies 54
But Auto-encoders Find Max Likelihood
• Minimal error => maximum likelihood
• Maximum likelihood => maximum compression
• So good anomaly detectors give good compression
© 2014 MapR Technologies 55
In Case You Want the Details
E pk logpk[ ] = pk logpk
k
å
log x £ x -1
pk log
pk
pkk
å £ pk 1-
pk
pk
æ
è
ç
ö
ø
÷ = pk
k
å - pk
k
å
k
å = 0
pk logpk - pk log pk
k
å £ 0
k
å
pk logpk £ pk log pk
k
å
k
å
E pk logpk[ ] »
1
n
logpxi
i
å
© 2014 MapR Technologies 56
Pause To Reflect on Clustering
• Use windowing to apportion signal
– Hamming windows add up to 1
• Find nearest cluster for each window
– Can use dot product because all clusters normalized
• Scale cluster to right size
– Dot product again
• Subtract from original signal
© 2014 MapR Technologies 57
Auto-encoding - Information Bottleneck
Original Signal
Encoder
Reconstructed Signal
© 2014 MapR Technologies 58
x1 x2
...
xn-1 xn
x'1 x'2
...
x'n-1 x'n
Input
Hidden layer
(clusters)
Reconstruction
Clustering as Neural Network
Hidden layer is 1
of k
Could be m
of kSparsity allows k >>
100
© 2014 MapR Technologies 59
Overlapping Networks
Time series input
Reconstructed time series
...
...
...
...
...
...
...
...
...
...
© 2014 MapR Technologies 60
Deep Learning
...
...
...
...
... ... ... ... ...
... ... ... ... ...
A
B
A'
© 2014 MapR Technologies 61
What About the Database?
• We don’t have to keep the reconstruction
• We can keep the first level nodes
– And the reconstruction error
• To keep the first level nodes
– We can keep the second level nodes
– Plus the reconstruction error
© 2014 MapR Technologies 62
What Does it Matter?
• Even one level of auto-encoding compresses
– 30-50x in EKG example with k-means
• Multiple levels compress more
– Understanding => Truth => Compression
• Higher levels give semantic search
© 2014 MapR Technologies 63
How Do I Build Such a System
• The key is to combine real-time and long-time
– real-time evaluates data stream against model
– long-time is how we build the model
• Extended Lambda architecture is my favorite
• See my other talks on slideshare.net for info
• Ping me directly
© 2014 MapR Technologies 64
t
now
Hadoop is Not Very Real-time
UnprocessedData
Fully processed Latest full
period
Hadoop job
takes this long
for this data
© 2014 MapR Technologies 65
t
now
Hadoop works great
back here
Storm
works
here
Real-time and Long-time together
Blended viewBlended viewBlended View
© 2014 MapR Technologies 66
Who I am
• Ted Dunning, Chief Application Architect, MapR
tdunning@mapr.com
tdunning@apache.org
@ted_dunning
• Committer, mentor, champion, PMC member on several Apache
projects
• Mahout, Drill, Zookeeper others
© 2014 MapR Technologies 67
© MapR Technologies, confidential

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Devday 2017 Hands On Presentation
Devday 2017 Hands On PresentationDevday 2017 Hands On Presentation
Devday 2017 Hands On Presentation
 
London hug
London hugLondon hug
London hug
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
 
A Transcat.com Webinar Presented by Aglient Technolgoes: Scope Technology Imp...
A Transcat.com Webinar Presented by Aglient Technolgoes: Scope Technology Imp...A Transcat.com Webinar Presented by Aglient Technolgoes: Scope Technology Imp...
A Transcat.com Webinar Presented by Aglient Technolgoes: Scope Technology Imp...
 
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization PerspectiveNew Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio data
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
 

Ähnlich wie How to find what you didn't know to look for, oractical anomaly detection

How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
DataWorks Summit
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
DataWorks Summit
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
DataWorks Summit
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
DataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
 
Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith Chaos
MapR Technologies
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
 

Ähnlich wie How to find what you didn't know to look for, oractical anomaly detection (20)

Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith Chaos
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

How to find what you didn't know to look for, oractical anomaly detection

  • 1. © 2014 MapR Technologies 1 © MapR Technologies, confidential Anomaly Detection, Time-Series Data Bases and More How to Find What You Didn’t Know to Look For
  • 2. © 2014 MapR Technologies 2 Agenda • What is anomaly detection? • Some examples • Some generalization • Compression == Truth • Deep dive into deep learning • Why this matters for time series databases This goes beyond what was described in the abstract…
  • 3. © 2014 MapR Technologies 3 Who I am • Ted Dunning, Chief Application Architect, MapR tdunning@mapr.com tdunning@apache.org Twitter @ted_dunning • Committer, mentor, champion, PMC member on several Apache projects • Mahout, Drill, Zookeeper, Spark and others • Hashtag for today’s talk #HadoopSummit
  • 4. © 2014 MapR Technologies 4 Who we are • MapR makes the technology leading distribution including Hadoop • MapR integrates real-time data semantics directly into a system that also runs Hadoop programs seamlessly • The biggest and best choose MapR – Google, Amazon – Largest credit card, retailer, health insurance, telco – Ping me for info
  • 5. © 2014 MapR Technologies 5 A New Look at Anomaly Detection • O’Reilly series Practical Machine Learning 2nd volume • Print copies available at Hadoop Summit at MapR booth • eBook available at http://bit.ly/1jQ9QuL • Book signing by both authors at 4pm Wed (today)
  • 6. © 2014 MapR Technologies 6 What is Anomaly Detection? • What just happened that shouldn’t? – but I don’t know what failure looks like (yet) • Find the problem before other people see it – especially customers and CEO’s • But don’t wake me up if it isn’t really broken
  • 7. © 2014 MapR Technologies 7
  • 8. © 2014 MapR Technologies 8 Looks pretty anomalous to me
  • 9. © 2014 MapR Technologies 9 Will the real anomaly please stand up?
  • 10. © 2014 MapR Technologies 10 What Are We Really Doing • We want action when something breaks (dies/falls over/otherwise gets in trouble) • But action is expensive • So we don’t want false alarms • And we don’t want false negatives • We need to trade off costs
  • 11. © 2014 MapR Technologies 11 A Second Look
  • 12. © 2014 MapR Technologies 12 A Second Look 99.9%-ile
  • 13. © 2014 MapR Technologies 13 Online Summarizer 99.9%-ile t x > t ? Alarm ! x How Hard Can it Be?
  • 14. © 2014 MapR Technologies 14 On-line Percentile Estimates • Apache Mahout has on-line percentile estimator – very high accuracy for extreme tails – new in version 0.9 !! • What’s the big deal with anomaly detection? • This looks like a solved problem
  • 15. © 2014 MapR Technologies 15 Already Done? Etsy Skyline?
  • 16. © 2014 MapR Technologies 16 What About This? 0 5 10 15 −20246810 offset+noise+pulse1+pulse2 A B
  • 17. © 2014 MapR Technologies 17 Spot the Anomaly Anomaly?
  • 18. © 2014 MapR Technologies 18 Maybe not!
  • 19. © 2014 MapR Technologies 19 Where’s Waldo? This is the real anomaly
  • 20. © 2014 MapR Technologies 20 Normal Isn’t Just Normal • What we want is a model of what is normal • What doesn’t fit the model is the anomaly • For simple signals, the model can be simple … • The real world is rarely so accommodating x ~ N(0,e)
  • 21. © 2014 MapR Technologies 21 We Do Windows
  • 22. © 2014 MapR Technologies 22 We Do Windows
  • 23. © 2014 MapR Technologies 23 We Do Windows
  • 24. © 2014 MapR Technologies 24 We Do Windows
  • 25. © 2014 MapR Technologies 25 We Do Windows
  • 26. © 2014 MapR Technologies 26 We Do Windows
  • 27. © 2014 MapR Technologies 27 We Do Windows
  • 28. © 2014 MapR Technologies 28 We Do Windows
  • 29. © 2014 MapR Technologies 29 We Do Windows
  • 30. © 2014 MapR Technologies 30 We Do Windows
  • 31. © 2014 MapR Technologies 31 We Do Windows
  • 32. © 2014 MapR Technologies 32 We Do Windows
  • 33. © 2014 MapR Technologies 33 We Do Windows
  • 34. © 2014 MapR Technologies 34 We Do Windows
  • 35. © 2014 MapR Technologies 35 We Do Windows
  • 36. © 2014 MapR Technologies 36 Windows on the World • The set of windowed signals is a nice model of our original signal • Clustering can find the prototypes – Fancier techniques available using sparse coding • The result is a dictionary of shapes • New signals can be encoded by shifting, scaling and adding shapes from the dictionary
  • 37. © 2014 MapR Technologies 37 Most Common Shapes (for EKG)
  • 38. © 2014 MapR Technologies 38 Reconstructed signal Original signal Reconstructed signal Reconstruction error < 1 bit / sample
  • 39. © 2014 MapR Technologies 39 An Anomaly Original technique for finding 1-d anomaly works against reconstruction error
  • 40. © 2014 MapR Technologies 40 Close-up of anomaly Not what you want your heart to do. And not what the model expects it to do.
  • 41. © 2014 MapR Technologies 41 A Different Kind of Anomaly
  • 42. © 2014 MapR Technologies 42 Model Delta Anomaly Detection Online Summarizer δ > t ? 99.9%-ile t Alarm ! Model - + δ
  • 43. © 2014 MapR Technologies 43 The Real Inside Scoop • The model-delta anomaly detector is really just a sum of random variables – the model we know about already – and a normally distributed error • The output (delta) is (roughly) the log probability of the sum distribution (really δ2) • Thinking about probability distributions is good
  • 44. © 2014 MapR Technologies 44 Example: Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values • Want alert as soon as possible
  • 45. © 2014 MapR Technologies 45 Poisson Distribution • Time between events is exponentially distributed • This means that long delays are exponentially rare • If we know λ we can select a good threshold – or we can pick a threshold empirically Dt ~ le-lt P(Dt > T) = e-lT -logP(Dt > T) = lT
  • 46. © 2014 MapR Technologies 46 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  • 47. © 2014 MapR Technologies 51 Recap (out of order) • Anomaly detection is best done with a probability model • -log p is a good way to convert to anomaly measure • Adaptive quantile estimation works for auto-setting thresholds
  • 48. © 2014 MapR Technologies 52 Recap • Different systems require different models • Continuous time-series – sparse coding to build signal model • Events in time – rate model base on variable rate Poisson – segregated rate model • Events with labels – language modeling – hidden Markov models
  • 49. © 2014 MapR Technologies 53 But Wait! Compression is Truth • Maximizing log πk is minimizing compressed size – (each symbol takes -log πk bits on average) • Maximizing log πk happens where πk = pk – (maximum likelihood principle)
  • 50. © 2014 MapR Technologies 54 But Auto-encoders Find Max Likelihood • Minimal error => maximum likelihood • Maximum likelihood => maximum compression • So good anomaly detectors give good compression
  • 51. © 2014 MapR Technologies 55 In Case You Want the Details E pk logpk[ ] = pk logpk k å log x £ x -1 pk log pk pkk å £ pk 1- pk pk æ è ç ö ø ÷ = pk k å - pk k å k å = 0 pk logpk - pk log pk k å £ 0 k å pk logpk £ pk log pk k å k å E pk logpk[ ] » 1 n logpxi i å
  • 52. © 2014 MapR Technologies 56 Pause To Reflect on Clustering • Use windowing to apportion signal – Hamming windows add up to 1 • Find nearest cluster for each window – Can use dot product because all clusters normalized • Scale cluster to right size – Dot product again • Subtract from original signal
  • 53. © 2014 MapR Technologies 57 Auto-encoding - Information Bottleneck Original Signal Encoder Reconstructed Signal
  • 54. © 2014 MapR Technologies 58 x1 x2 ... xn-1 xn x'1 x'2 ... x'n-1 x'n Input Hidden layer (clusters) Reconstruction Clustering as Neural Network Hidden layer is 1 of k Could be m of kSparsity allows k >> 100
  • 55. © 2014 MapR Technologies 59 Overlapping Networks Time series input Reconstructed time series ... ... ... ... ... ... ... ... ... ...
  • 56. © 2014 MapR Technologies 60 Deep Learning ... ... ... ... ... ... ... ... ... ... ... ... ... ... A B A'
  • 57. © 2014 MapR Technologies 61 What About the Database? • We don’t have to keep the reconstruction • We can keep the first level nodes – And the reconstruction error • To keep the first level nodes – We can keep the second level nodes – Plus the reconstruction error
  • 58. © 2014 MapR Technologies 62 What Does it Matter? • Even one level of auto-encoding compresses – 30-50x in EKG example with k-means • Multiple levels compress more – Understanding => Truth => Compression • Higher levels give semantic search
  • 59. © 2014 MapR Technologies 63 How Do I Build Such a System • The key is to combine real-time and long-time – real-time evaluates data stream against model – long-time is how we build the model • Extended Lambda architecture is my favorite • See my other talks on slideshare.net for info • Ping me directly
  • 60. © 2014 MapR Technologies 64 t now Hadoop is Not Very Real-time UnprocessedData Fully processed Latest full period Hadoop job takes this long for this data
  • 61. © 2014 MapR Technologies 65 t now Hadoop works great back here Storm works here Real-time and Long-time together Blended viewBlended viewBlended View
  • 62. © 2014 MapR Technologies 66 Who I am • Ted Dunning, Chief Application Architect, MapR tdunning@mapr.com tdunning@apache.org @ted_dunning • Committer, mentor, champion, PMC member on several Apache projects • Mahout, Drill, Zookeeper others
  • 63. © 2014 MapR Technologies 67 © MapR Technologies, confidential

Hinweis der Redaktion

  1. USE THIS
  2. USE THIS