SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Anomaly	
  Detec-on	
  @	
  Twi2er	
  
Vijay Rajaram, Jenna Zhang, Arun Kejariwal 
(@djvjallday, @jenna_zz, @arun_kejariwal)


February 2015
Internet	
  Trends:	
  Real-­‐-me	
  Communica-on	
  
AK	
  2	
  
Data	
  Fidelity	
  
•  Data-driven decision making
q Evolving product landscape
•  Data partners
q Nielsen
q Dataminr
•  Operational
q Performance and Availability
AK	
  3	
  
A/B	
  Tes-ng	
  
Data	
  Fidelity:	
  Challenges	
  
•  Anomalies
q Exogenic factors
§  User behavior
§  Events
§  Data center
q Endogenic factors
§  Agile development
o  Fail fast
§  Data collection
•  Millions of time series [1,2]
q Scalability
AK	
  4	
  
[1]	
  h2p://strata.oreilly.com/2013/09/how-­‐twi2er-­‐monitors-­‐millions-­‐of-­‐-me-­‐series.html	
  
[2]	
  h2p://strataconf.com/strata2014/public/schedule/detail/32431	
  
Anomaly	
  Detec-on	
  
•  Visual
q Prone to errors
q Not scalable
§  Machine generated data 

 11% of the digital universe in 2005 
to > 40% by 2020 [1]

§  Cloud Infrastructure 2013-2017 CAGR ~50% [2]
•  Algorithmic approach
q Automate!
AK	
  5	
  [1]	
  h2p://www.emc.com/about/news/press/2012/20121211-­‐01.htm	
  
[2]	
  h2p://www.forbes.com/sites/gilpress/2013/12/12/16-­‐1-­‐billion-­‐big-­‐data-­‐market-­‐2014-­‐predic-ons-­‐from-­‐idc-­‐and-­‐iia/	
  	
  
Anomaly	
  Detec-on:	
  Background	
  
•  Over 50 years of research [1]
q Statistics
§  Extreme Value Theory
§  Robust Statistics, Grubb’s Test, ESD
q Econometrics
q Finance
§  Value at Risk (VaR)
q Signal Processing 
q Music Information Retrieval
q Networking
q E- Commerce
q Performance Regression
AK	
  6	
  [1]	
  “Anomaly	
  Detec-on”	
  by	
  Chandola	
  et	
  al.	
  ACM	
  Compu-ng	
  Surveys,	
  2009.	
  	
  
Anomaly	
  Detec-on	
  
•  Characterization
q Magnitude
q Width
q Frequency
q Direction
	
  AK	
  7	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
•  Two flavors
q Global
§  Max Value
q Local
§  Intra-day 

AK	
  8	
  
Global
Local
Anomaly	
  Detec-on	
  	
  (contd.)	
  
•  Traditional Approaches
q Metrics
§  Mean μ
§  Variance σ
q Rule of thumb
§  μ + 3*σ
q Which time series?
§  Raw
§  Moving Averages
o  SMA, EWMA, PEWMA
AK	
  9	
  
3 * σ
Anomaly	
  Detec-on	
  	
  (contd.)	
  
•  Impact of multi-modal distribution
q μ Shift ~ 0.2%
q Inflates σ by 4.5%
§  Miss quite a few anomalies
q What do multiple modes correspond to?
§  Seasonality
AK	
  10	
  
•  Robust Statistics
q MAD
§  Robust Breakdown point
o  Median 50% vs. Mean 0%
q σMAD
§  K = 1.4826 for normally distributed data
AK	
  11	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
•  Grubb’s Test
q Critical value is derived from data using a statistical confidence (α)
•  ESD (Generalized Extreme Studentized Deviate) [1]
q Critical value (λi) re-calculated every iteration
q Largest i such that Ri > λi determines # of anomalies
q An upper-bound on the number of anomalies is an input parameter
AK	
  12	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
[1]	
  Rosner,	
  Bernard.	
  “Percentage	
  Points	
  for	
  a	
  Generalized	
  ESD	
  Many-­‐outlier	
  Procedure.”	
  Technometrics	
  25,	
  no.	
  2	
  (1983):	
  165–172.	
  
Our	
  Approach	
  
•  Addressing Seasonality
q Key Idea 
§  Time Series Decomposition
AK	
  14	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
•  Impact of removal of seasonal and trend
q Transforms our multi-modal data into unimodal data.
§  Amenable to ESD/MAD! 
AK	
  15	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
The decomposed Residual
becomes "Uni-modal". This
significantly shrinks the value of
sigma. 
The original "Multi-Modal"
Raw Data has a much wider
value for sigma, leading ESD
to miss a lot of the outliers.
Trend Smoothing Distortion
Creates “Phantom” Anomalies
•  Challenges remain!
AK	
  16	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
•  Marrying Robust Statistics with Seasonal Decomposition
AK	
  17	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
Median is Free from Distortion
•  Applying ESD on the Residual
AK	
  18	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
Decomposition Exposes Anomalies
•  Illustrative example
AK	
  19	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
•  Standalone R package
q https://github.com/twitter/AnomalyDetection
q Key features
§  Filter
o  Last day, Last hour
o  Direction: positive, negative, both
§  Expected values
§  Long term
o  Piecewise approximation (HotCloud’14 research paper)
q Widely used
•  Blog
q  https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series
AK	
  20	
  
Open	
  Source	
  
•  Pluggable design
q Data source
§  Currently, support different data sources
q Detector
•  Usage 
q Library: 
§  Mesos job
q Service 
§  RESTful API
AK	
  21	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
Status	
  
	
  
Used	
  by	
  10+	
  internal	
  customers	
  
•  E-mail notification 
AK	
  22	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
•  JIRA integration
q  Ticket auto-created if anomaly detected
•  Granularities
q Daily
§  Seasonal adjustment based on day of the week
o  Keep things simple
q Minutely
§  S-H-ESD
AK	
  23	
  
Anomaly	
  Detec-on	
  	
  (contd.)	
  
•  Lessons learned in the wild
q Summingbird [1] - Lambda architecture

q Real time: Data integrity issues - lag between real time and batch
§  Periodic update to cache
§  Higher threshold
AK	
  24	
  
Real-­‐-me	
  Anomaly	
  Detec-on	
  
[1]	
  "Summingbird:	
  a	
  framework	
  for	
  integra-ng	
  batch	
  and	
  online	
  MapReduce	
  computa-ons",	
  	
  by	
  O.	
  Boykin	
  and	
  S.	
  Ritchie	
  and	
  I.	
  O'Connell	
  and	
  J.	
  Lin.	
  	
  Proceedings	
  of	
  the	
  VLDB	
  Endowment,	
  7:13,	
  pp.	
  1441-­‐1451,	
  August	
  2014.	
  
•  Lessons learned in the wild
q JVM R bridges 
§  High latency
§  Exception handling missing
q Looping future model 
§  Finagle
q Few historical anomalies
AK	
  25	
  
Real-­‐-me	
  Anomaly	
  Detec-on	
  (contd.)	
  
•  Future work
q Streaming algorithms
§  Key for sub-minute data granularity

q Making job more robust
§  Minimizing false positives
§  Real-time topology uptime
q More use cases
§  Multiple time series (correlation)
§  Core metrics
AK	
  26	
  
Real-­‐-me	
  Anomaly	
  Detec-on	
  (contd.)	
  
Join	
  the	
  Flock	
  
•  We are hiring!!
q https://twitter.com/JoinTheFlock
q https://twitter.com/jobs
q Contact us: @arun_kejariwal
Like	
  problem	
  solving?	
  	
   Like	
  challenges?	
  	
   Be	
  at	
  cuing	
  Edge	
  	
   Make	
  an	
  impact	
  
AK	
  27	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component AnalysisSunjeet Jena
 
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmK Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmDataMites
 
Machine Learning basics
Machine Learning basicsMachine Learning basics
Machine Learning basicsNeeleEilers
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningSangwoo Mo
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: ClusteringDeepak George
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2Fabio Fumarola
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection철 김
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]ssuser23e4f31
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
 

Was ist angesagt? (20)

Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmK Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
 
Outlier Detection
Outlier DetectionOutlier Detection
Outlier Detection
 
Machine Learning basics
Machine Learning basicsMachine Learning basics
Machine Learning basics
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: Clustering
 
Ridge regression
Ridge regressionRidge regression
Ridge regression
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
 
Bayesian inference
Bayesian inferenceBayesian inference
Bayesian inference
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
 

Ähnlich wie Anomaly Detection @Twitter

Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterArun Kejariwal
 
PosterPresentation
PosterPresentationPosterPresentation
PosterPresentationRaj Shekhar
 
Part Arrival Prediction Engine
Part Arrival Prediction EnginePart Arrival Prediction Engine
Part Arrival Prediction EngineBalaji Mohan
 
AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptxNatan Katz
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantParis Carbone
 
GridPIQ_Webinar_2018_08_28.pptx
GridPIQ_Webinar_2018_08_28.pptxGridPIQ_Webinar_2018_08_28.pptx
GridPIQ_Webinar_2018_08_28.pptxMThamilAlagan
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
PEARC17: Visual exploration and analysis of time series earthquake data
PEARC17: Visual exploration and analysis of time series earthquake dataPEARC17: Visual exploration and analysis of time series earthquake data
PEARC17: Visual exploration and analysis of time series earthquake dataAmit Chourasia
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptQingsong Yao
 
Why Data Management Needs To Be Involved In Study Design, Katrien Vermeiren, ...
Why Data Management Needs To Be Involved In Study Design, Katrien Vermeiren, ...Why Data Management Needs To Be Involved In Study Design, Katrien Vermeiren, ...
Why Data Management Needs To Be Involved In Study Design, Katrien Vermeiren, ...Steffan Stringer
 
Final observability starts_with_data
Final observability starts_with_dataFinal observability starts_with_data
Final observability starts_with_dataDave McAllister
 
TAO Refresh - Automation of Data Spike Flagging Quality
TAO Refresh - Automation of Data Spike Flagging Quality TAO Refresh - Automation of Data Spike Flagging Quality
TAO Refresh - Automation of Data Spike Flagging Quality Sathishkumar Samiappan
 
Digital supply chain quality management
Digital supply chain quality managementDigital supply chain quality management
Digital supply chain quality managementMartin Geddes
 
impervious cover
impervious coverimpervious cover
impervious coverJames Yang
 
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...Sayonsom Chanda
 
A Segmentation of Water Consumption with Apache Spark
A Segmentation of Water Consumption with Apache SparkA Segmentation of Water Consumption with Apache Spark
A Segmentation of Water Consumption with Apache SparkDiego García Valverde
 
22 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-1722 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-17indiawrm
 
Quantiative Risk Analysis for the Aerospace Industry
Quantiative Risk Analysis for the Aerospace IndustryQuantiative Risk Analysis for the Aerospace Industry
Quantiative Risk Analysis for the Aerospace IndustryIntaver Insititute
 

Ähnlich wie Anomaly Detection @Twitter (20)

Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ Twitter
 
PosterPresentation
PosterPresentationPosterPresentation
PosterPresentation
 
Part Arrival Prediction Engine
Part Arrival Prediction EnginePart Arrival Prediction Engine
Part Arrival Prediction Engine
 
AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptx
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are important
 
GridPIQ_Webinar_2018_08_28.pptx
GridPIQ_Webinar_2018_08_28.pptxGridPIQ_Webinar_2018_08_28.pptx
GridPIQ_Webinar_2018_08_28.pptx
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
PEARC17: Visual exploration and analysis of time series earthquake data
PEARC17: Visual exploration and analysis of time series earthquake dataPEARC17: Visual exploration and analysis of time series earthquake data
PEARC17: Visual exploration and analysis of time series earthquake data
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.ppt
 
Why Data Management Needs To Be Involved In Study Design, Katrien Vermeiren, ...
Why Data Management Needs To Be Involved In Study Design, Katrien Vermeiren, ...Why Data Management Needs To Be Involved In Study Design, Katrien Vermeiren, ...
Why Data Management Needs To Be Involved In Study Design, Katrien Vermeiren, ...
 
Final observability starts_with_data
Final observability starts_with_dataFinal observability starts_with_data
Final observability starts_with_data
 
TAO Refresh - Automation of Data Spike Flagging Quality
TAO Refresh - Automation of Data Spike Flagging Quality TAO Refresh - Automation of Data Spike Flagging Quality
TAO Refresh - Automation of Data Spike Flagging Quality
 
Digital supply chain quality management
Digital supply chain quality managementDigital supply chain quality management
Digital supply chain quality management
 
impervious cover
impervious coverimpervious cover
impervious cover
 
Louisiana coastal master plan
Louisiana coastal master planLouisiana coastal master plan
Louisiana coastal master plan
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
 
A Segmentation of Water Consumption with Apache Spark
A Segmentation of Water Consumption with Apache SparkA Segmentation of Water Consumption with Apache Spark
A Segmentation of Water Consumption with Apache Spark
 
22 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-1722 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-17
 
Quantiative Risk Analysis for the Aerospace Industry
Quantiative Risk Analysis for the Aerospace IndustryQuantiative Risk Analysis for the Aerospace Industry
Quantiative Risk Analysis for the Aerospace Industry
 

Kürzlich hochgeladen

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Kürzlich hochgeladen (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Anomaly Detection @Twitter

  • 1. Anomaly  Detec-on  @  Twi2er   Vijay Rajaram, Jenna Zhang, Arun Kejariwal (@djvjallday, @jenna_zz, @arun_kejariwal) February 2015
  • 2. Internet  Trends:  Real-­‐-me  Communica-on   AK  2  
  • 3. Data  Fidelity   •  Data-driven decision making q Evolving product landscape •  Data partners q Nielsen q Dataminr •  Operational q Performance and Availability AK  3   A/B  Tes-ng  
  • 4. Data  Fidelity:  Challenges   •  Anomalies q Exogenic factors §  User behavior §  Events §  Data center q Endogenic factors §  Agile development o  Fail fast §  Data collection •  Millions of time series [1,2] q Scalability AK  4   [1]  h2p://strata.oreilly.com/2013/09/how-­‐twi2er-­‐monitors-­‐millions-­‐of-­‐-me-­‐series.html   [2]  h2p://strataconf.com/strata2014/public/schedule/detail/32431  
  • 5. Anomaly  Detec-on   •  Visual q Prone to errors q Not scalable §  Machine generated data 11% of the digital universe in 2005 to > 40% by 2020 [1] §  Cloud Infrastructure 2013-2017 CAGR ~50% [2] •  Algorithmic approach q Automate! AK  5  [1]  h2p://www.emc.com/about/news/press/2012/20121211-­‐01.htm   [2]  h2p://www.forbes.com/sites/gilpress/2013/12/12/16-­‐1-­‐billion-­‐big-­‐data-­‐market-­‐2014-­‐predic-ons-­‐from-­‐idc-­‐and-­‐iia/    
  • 6. Anomaly  Detec-on:  Background   •  Over 50 years of research [1] q Statistics §  Extreme Value Theory §  Robust Statistics, Grubb’s Test, ESD q Econometrics q Finance §  Value at Risk (VaR) q Signal Processing q Music Information Retrieval q Networking q E- Commerce q Performance Regression AK  6  [1]  “Anomaly  Detec-on”  by  Chandola  et  al.  ACM  Compu-ng  Surveys,  2009.    
  • 7. Anomaly  Detec-on   •  Characterization q Magnitude q Width q Frequency q Direction  AK  7  
  • 8. Anomaly  Detec-on    (contd.)   •  Two flavors q Global §  Max Value q Local §  Intra-day AK  8   Global Local
  • 9. Anomaly  Detec-on    (contd.)   •  Traditional Approaches q Metrics §  Mean μ §  Variance σ q Rule of thumb §  μ + 3*σ q Which time series? §  Raw §  Moving Averages o  SMA, EWMA, PEWMA AK  9   3 * σ
  • 10. Anomaly  Detec-on    (contd.)   •  Impact of multi-modal distribution q μ Shift ~ 0.2% q Inflates σ by 4.5% §  Miss quite a few anomalies q What do multiple modes correspond to? §  Seasonality AK  10  
  • 11. •  Robust Statistics q MAD §  Robust Breakdown point o  Median 50% vs. Mean 0% q σMAD §  K = 1.4826 for normally distributed data AK  11   Anomaly  Detec-on    (contd.)  
  • 12. •  Grubb’s Test q Critical value is derived from data using a statistical confidence (α) •  ESD (Generalized Extreme Studentized Deviate) [1] q Critical value (λi) re-calculated every iteration q Largest i such that Ri > λi determines # of anomalies q An upper-bound on the number of anomalies is an input parameter AK  12   Anomaly  Detec-on    (contd.)   [1]  Rosner,  Bernard.  “Percentage  Points  for  a  Generalized  ESD  Many-­‐outlier  Procedure.”  Technometrics  25,  no.  2  (1983):  165–172.  
  • 14. •  Addressing Seasonality q Key Idea §  Time Series Decomposition AK  14   Anomaly  Detec-on    (contd.)  
  • 15. •  Impact of removal of seasonal and trend q Transforms our multi-modal data into unimodal data. §  Amenable to ESD/MAD! AK  15   Anomaly  Detec-on    (contd.)   The decomposed Residual becomes "Uni-modal". This significantly shrinks the value of sigma. The original "Multi-Modal" Raw Data has a much wider value for sigma, leading ESD to miss a lot of the outliers.
  • 16. Trend Smoothing Distortion Creates “Phantom” Anomalies •  Challenges remain! AK  16   Anomaly  Detec-on    (contd.)  
  • 17. •  Marrying Robust Statistics with Seasonal Decomposition AK  17   Anomaly  Detec-on    (contd.)   Median is Free from Distortion
  • 18. •  Applying ESD on the Residual AK  18   Anomaly  Detec-on    (contd.)   Decomposition Exposes Anomalies
  • 19. •  Illustrative example AK  19   Anomaly  Detec-on    (contd.)  
  • 20. •  Standalone R package q https://github.com/twitter/AnomalyDetection q Key features §  Filter o  Last day, Last hour o  Direction: positive, negative, both §  Expected values §  Long term o  Piecewise approximation (HotCloud’14 research paper) q Widely used •  Blog q  https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series AK  20   Open  Source  
  • 21. •  Pluggable design q Data source §  Currently, support different data sources q Detector •  Usage q Library: §  Mesos job q Service §  RESTful API AK  21   Anomaly  Detec-on    (contd.)   Status     Used  by  10+  internal  customers  
  • 22. •  E-mail notification AK  22   Anomaly  Detec-on    (contd.)   •  JIRA integration q  Ticket auto-created if anomaly detected
  • 23. •  Granularities q Daily §  Seasonal adjustment based on day of the week o  Keep things simple q Minutely §  S-H-ESD AK  23   Anomaly  Detec-on    (contd.)  
  • 24. •  Lessons learned in the wild q Summingbird [1] - Lambda architecture q Real time: Data integrity issues - lag between real time and batch §  Periodic update to cache §  Higher threshold AK  24   Real-­‐-me  Anomaly  Detec-on   [1]  "Summingbird:  a  framework  for  integra-ng  batch  and  online  MapReduce  computa-ons",    by  O.  Boykin  and  S.  Ritchie  and  I.  O'Connell  and  J.  Lin.    Proceedings  of  the  VLDB  Endowment,  7:13,  pp.  1441-­‐1451,  August  2014.  
  • 25. •  Lessons learned in the wild q JVM R bridges §  High latency §  Exception handling missing q Looping future model §  Finagle q Few historical anomalies AK  25   Real-­‐-me  Anomaly  Detec-on  (contd.)  
  • 26. •  Future work q Streaming algorithms §  Key for sub-minute data granularity q Making job more robust §  Minimizing false positives §  Real-time topology uptime q More use cases §  Multiple time series (correlation) §  Core metrics AK  26   Real-­‐-me  Anomaly  Detec-on  (contd.)  
  • 27. Join  the  Flock   •  We are hiring!! q https://twitter.com/JoinTheFlock q https://twitter.com/jobs q Contact us: @arun_kejariwal Like  problem  solving?     Like  challenges?     Be  at  cuing  Edge     Make  an  impact   AK  27