Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

62.774 Aufrufe

Veröffentlicht am

We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io

Veröffentlicht in: Daten & Analysen
  • It's genuinely changed my life. I have been sleeping in the spare room for 4 months - and let's just say my sex life had become pretty boring! My wife and I were becoming strangers living in the same house. Thanks to your strategies, I am now back in our bed and the closeness and intimacy have returned. Thank you so much for taking the time to put all this together. It has genuinely changed my life. ♣♣♣ https://bit.ly/37PhtTN
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://shorturl.at/mzUV6 } ......................................................................................................................... Download Full EPUB Ebook here { http://shorturl.at/mzUV6 } ......................................................................................................................... Download Full doc Ebook here { http://shorturl.at/mzUV6 } ......................................................................................................................... Download PDF EBOOK here { http://shorturl.at/mzUV6 } ......................................................................................................................... Download EPUB Ebook here { http://shorturl.at/mzUV6 } ......................................................................................................................... Download doc Ebook here { http://shorturl.at/mzUV6 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Advertisers run several ad campaigns across multiple websites and mobile apps. These ad campaigns' KPIs need to be proactively monitored and optimized to increase their ROI. Hence, we have built our own automated campaign data anomaly detection system using machine learning. This system will help spot data anomalies in campaign performance data for thousands of campaigns dailyAdvertisers run several ad campaigns across multiple websites and mobile apps. These ad campaigns' KPIs need to be proactively monitored and optimized to increase their ROI. Hence, we have built our own automated campaign data anomaly detection system using machine learning. This system will help spot data anomalies in campaign performance data for thousands of campaigns daily http://bit.ly/2N5Z6kh
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

  1. 1. Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra Natalino Busa Data Platform Architect at Ing
  2. 2. ING group http://www.ing.com/About-us/Purpose-Strategy.htm
  3. 3. ING group Empowering people to stay a step ahead in life and in business. http://www.ing.com/About-us/Purpose-Strategy.htm
  4. 4. ING group http://www.ing.com/About-us/Purpose-Strategy.htm Clear and Easy Anytime, Anywhere Empower Keep getting better
  5. 5. Apply advanced, predictive analytics on live data Event-Driven and exposed via APIs Lean Architecture, Easy to integrate Available, Consistent, Streaming, Real-time Data Resilient, Distributed, Scalable, Maintainable Clear and Easy Anytime, Anywhere Empower Keep getting better Data Principles ING group
  6. 6. Big Data and Fast Data population:events,transactions, sessions,customers,etc
  7. 7. Why Fast Data? 1. Relevant up-to-date information. 2. Delivers actionable events.
  8. 8. Why Big Data? 1. Analyze and model 2. Learn, cluster, categorize, organize facts
  9. 9. 10 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Training, Scoring and Exposing models
  10. 10. 11 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Training, Scoring and Exposing models
  11. 11. 12 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Training, Scoring and Exposing models
  12. 12. Cassandra+Akka+Spark: Machine Learning Fast writes 2D Data Structure Replicated Tunable consistency Multi-Data centers C*Akka Spark Very Fast processing Distributed, Scalable computing Actor-based Pipelines Actor state can be persisted Supervision strategies Ad-Hoc Queries Joins, Aggregate User Defined Functions Machine Learning, Advanced Stats and Analytics
  13. 13. Akka-Cassandra-Spark Stack Cassandra-Spark Connector Cassandra Spark Streaming SQL MLlib Graphx Extract Data Create Models, Enrich, Transform Fetch from other Sources: Kafka Fetch from other Sources: DB’s, Files Akka Analytics, Statistics, Data Science, Model Training Access Model Persist Actors’ State
  14. 14. Cassandra-Spark Connector Cassandra: Store all the data Spark: Analyze all the data DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics! Data
  15. 15. Data Science: Anomaly Detection An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Hawkins, 1980
  16. 16. Data Science: Anomaly Detection Distance Based Density Based
  17. 17. Example: Analyze gowalla check-ins year | month | day | time | uid | lat | lon | ts | vid ------+-------+-----+------+--------+----------+-----------+--------------------------+--------- 2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955 2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160 2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870 Check-ins dataset Venues dataset vid | name | lat | long ------+-------+-----+------+--------+----------+----------- +--------------------------+--------- 754108 | My Suit NY | 40.73474 | -73.87434 249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405
  18. 18. Data Science: clustering venues
  19. 19. Data Science: clustering venues Weekly visitors patterns! Madison Square, Apple Store, Radio City Music Hall Thursdays, Fridays, Saturdays are busy Statue of Liberty, Jacob K. Javits Convention Center, Whole Foods Market (Columbus Circle) Not popular on midweek Intuition:
  20. 20. Data Science: clustering with k-means Histograms components as dimensions Similar histograms would occupy similar places in the feature space How do I compare histograms: - EMD - Chi-squared distance - Space transformation (DCT) Intuition:
  21. 21. K-Means: Featurize data + cluster val weekly_visits = checkins_venues.select("vid","ts") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => featurize_histogram(_._1)) val numClusters = 15 val numIterations = 100 val clusters = KMeans.train(weekly_visits, numClusters, numIterations) PairRDDs, weekly patterns per venue cluster similar weekly patterns
  22. 22. How to use it 1) Classification Classify venues to given groups 2) Anomaly Detection Detect shift in the clustering assignment for a given venue for a given week Keep monitoring weekly change in patterns, when it happens trigger a signal week 26 week 27 Action
  23. 23. Data Science: clustering users’ venues
  24. 24. Data Science: clustering users’ venues Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ... Intuition:
  25. 25. Data Science: clustering with DBSCAN DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical
  26. 26. Data Science: clustering users’ venues val locs = checkins_venues.select("uid", "lat","lon") .map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) )) .reduceByKey(_ + _) .mapValues( dbscan (_) ) Have a look at: scalanlp/nak
  27. 27. Data Science: Two ways to find anomalies with clustering - Cluster big amount of data with k-means and histograms - Apply clustering independently to million of users, to each identify the patterns with dbscan algorithm
  28. 28. MLlib vs PairRDDs KMeans.train(FeaturesRDD, numClusters, numIterations) UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan(_) ) RDDs map functions Parallelism easy to exploit The function runs locally for each Key Pick your fav machine learning algorithms Limited nr of points Running in parallel for millions of Keys MLlib Truly distributed algorithm Classify venues to given groups Millions of datapoints Limited amount of clusters
  29. 29. 30 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Training, Scoring and Exposing models
  30. 30. Training vs Scoring: Latency budget ● Akka: millisecond response ● Spark: in-memory data models Train: Spark Score: Spark Train: Spark Score: Akka slow: minutes fast: millisecs Model Scoring ModelTraining slow:minutes
  31. 31. Akka Mixed Load Cassandra Cluster Coral: Web API for dynamic data flows
  32. 32. Akka Web API for dynamic data flows ● a web api to define/manage/run streaming data-flows ● open source and community managed ● event processing as a service coral-streaming/coral Steven Raemaekers Jasper van Zandbeek Ger van Rossum Hoda Alemi Koen Verschuren
  33. 33. 34 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Summary:
  34. 34. Akka Feedback to the community: More Algorithms for machine learning! - DBSCAN, OPTICS, PAM - More metrics, non-euclidean spaces, etc - Non distributed algorithms: more scalanlp integration? Streaming all the way: Unify batch (Spark) and event streaming (Akka) computing
  35. 35. Thanks! - Vision and strategy on an event-driven bank - ING CIO management team and awesome colleagues Spark, Cassandra, Akka communities !
  36. 36. webinar + live demo: Dec 9th
  37. 37. Resources Coral: event processing webapi https://github.com/coral-streaming/coral Spark + Cassandra: Clustering Events http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html Spark: Machine Learning, SQL frames https://spark.apache.org/docs/latest/mllib-guide.html https://spark.apache.org/docs/latest/sql-programming-guide.html Datastax: Analytics and Spark connector http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html Anomaly Detection Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
  38. 38. Resources Datasets https://snap.stanford.edu/data/loc-gowalla.html E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011 https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009. . Pictures: "DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data. svg#/media/File:DBSCAN-density-data.svg "DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File: DBSCAN-Illustration.svg "Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons - https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png "Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https: //commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg "Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot. svg#/media/File:Michelsonmorley-boxplot.svg

×