Realtime Data Analysis Patterns

7.675 Aufrufe

Veröffentlicht am

Talk I gave at StratHadoop in Barcelona on November 21, 2014.

In this talk I discuss the experience we made with realtime analysis on high volume event data streams.

Veröffentlicht in: Internet

Realtime Data Analysis Patterns

  1. 1. Realtime Data Analysis Patterns Mikio Braun @mikiobraun streamdrill & TU Berlin O'Really Strata+Hadoop, Barcelona Nov 21, 2014 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  2. 2. How it all started: Realtime Twitter Retweet Trends Rails app + PostgreSQL About 100 tweets/second,and it got worse Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  3. 3. Road from there ● Version 1.0: Rails + PostgreSQL – store and batch ● Version 2.0: Scala + Cassandra – stream processing & working data on disk ● Version 3.0: streamdrill – “in-memory realtime analytics database” – approximative algorithms to bound resources – moderate parallelism for some things Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  4. 4. Lessons learned? Not just one kind of realtime. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  5. 5. Applications FFiinnaannccee GGaammiinngg MMoonniittoorriinngg AAddvveerrttiissmmeenntt SSeennssoorr NNeettwwoorrkkss SSoocciiaall MMeeddiiaa Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  6. 6. Two Dimensions of Real-Time Complexity Latency ● counting ● trends ● outlier detection ● recommendation ● prediction (churn, etc.) ● now (ms, RTB) ● seconds (fraud) ● hours (monitoring) ● days (reporting) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  7. 7. What makes realtime hard ● Many Events – 100 events / second – 360k per hour – 8.6M per day – 260M per month – 3.2B per year ● Many Objects http://www.flickr.com/photos/arenamontanus/269158554/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  8. 8. Classes of Realtime ● Events per second (100s? 1000s? 10k?) ● Number of objects (A few dozen? Millions?) ● Complexity (Counting? Trends?) ● Latency (Milliseconds? Hours?) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  9. 9. General Architecture Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  10. 10. Data Acquisition ● Flat files / HDFS ● Apache Flume / Logstash ● Apache Kafka for distributed logging Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  11. 11. Processing ● Depending on Latency: Batch or Streaming ● Batch – Apache Hadoop – Apache Spark – Apache Flink ● Streaming – Apache Storm – Apache Samza Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  12. 12. Query Layer ● Hadoop/Storm/Spark have no query layer ● Some db backend like redis to store the results Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  13. 13. Lambda Architecture: Mixing Batch & Streaming http://lambda-architecture.net/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  14. 14. Kappa Architecture http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  15. 15. Scaling vs. Approximation ● Scaling is expensive ● Not all results are relevant ● Data changes all the time anyway ● Approximate: Trade accuracy for resource usage Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  16. 16. Approximation harmful? Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  17. 17. Heavy Hitters ● Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) ● Interested in most active elements only. frank paul jan felix leo alex 15 12 8 5 3 2 Fixed tables of counts Case 1: element already in data base paul paul 12 13 Case 2: new element nico alex 2 nico 3 Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  18. 18. Count Min Sketch ● Summarize histograms over large feature sets ● Like bloom filters, but better m bins 0 0 3 0 1 1 0 2 0 2 0 0 0 3 5 2 0 5 3 2 2 4 5 0 1 3 7 3 0 2 0 8 n different hash functions Updates for new entry Query result: 1 ● Query: Take minimum over all hash functions G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  19. 19. Hyper Log Log ● Hash stream to generate random bit strings ● Look for infrequent events ● If probability is one hundreths → should have seen 100 events on average if it occurs. ● Average to improve estimate. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  20. 20. Comparing Approx. Algorithms ● Heavy Hitters: – approx. counts + top-k – large memory requirement ● Count Min Sketch – approx. counts for all, but no top-k, no elements – needs to know size beforehand ● HyperLogLog – approx. number of distinct elements Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  21. 21. Exponential Decay Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  22. 22. Beyond Counting Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  23. 23. Streamdrill & Demos ● Realtime Analysis Solutions ● Core Engine: – Heavy Hitters + exponential decay + seconndary indices – Instant counts & top-k results over time windows – In-memory – Written in Scala ● Modules – Profiling and Trending – Recommendations – Count Distinct Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  24. 24. Example: Twitter Stock Analysis http://play.streamdrill.com/vis/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  25. 25. Example: Twitter Stock Analysis ● Trends: – symbol:combinations $AAPL:$GOOG – symbol:hashtag $AAPL:#trading – symbol:keywords $GOOG:disruption – symbol:mentions $GOOG:WallStreetCom – symbol trend $AAPL – symbol:url $FB:http://on.wsj.com/15fHaZW Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  26. 26. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  27. 27. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  28. 28. Example: Twitter Stock Analysis Twitter streamdrill JavaScript via REST tweets Tweet Analyzer updates Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  29. 29. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  30. 30. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  31. 31. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  32. 32. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  33. 33. Realtime user profiles ● Process 10k events / second on one machine ● Track about 1 Million counts per 1 GB ● Shard by user for higher accuracy Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  34. 34. Realtime Data Analysis Patterns ● Acquisition / Processing / Query Layer ● Acquisition: Flat files and distributed logs ● Processing: Scaling batch or streaming ● Query Layer: Separate query from processing ● Lambda and Kappa Architecture ● Approximation as alternative to scaling ● Trends with indices as building blocks for data analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  35. 35. Thank You Mikio Braun mikio@streamdrill.com @mikiobraun Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun

×