Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Realtime Data 
Analysis Patterns 
Mikio Braun 
@mikiobraun 
streamdrill & TU Berlin 
O'Really Strata+Hadoop, Barcelona 
No...
How it all started: Realtime 
Twitter Retweet Trends 
Rails app + PostgreSQL 
About 100 tweets/second,and it got worse 
Mi...
Road from there 
● Version 1.0: Rails + PostgreSQL 
– store and batch 
● Version 2.0: Scala + Cassandra 
– stream processi...
Lessons learned? 
Not just one kind of 
realtime. 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by...
Applications 
FFiinnaannccee GGaammiinngg MMoonniittoorriinngg 
AAddvveerrttiissmmeenntt SSeennssoorr NNeettwwoorrkkss SSo...
Two Dimensions of Real-Time 
Complexity Latency 
● counting 
● trends 
● outlier detection 
● recommendation 
● prediction...
What makes realtime hard 
● Many Events 
– 100 events / second 
– 360k per hour 
– 8.6M per day 
– 260M per month 
– 3.2B ...
Classes of Realtime 
● Events per second (100s? 1000s? 10k?) 
● Number of objects (A few dozen? Millions?) 
● Complexity (...
General Architecture 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Data Acquisition 
● Flat files / HDFS 
● Apache Flume / Logstash 
● Apache Kafka for distributed logging 
Mikio L. Braun, ...
Processing 
● Depending on Latency: Batch or Streaming 
● Batch 
– Apache Hadoop 
– Apache Spark 
– Apache Flink 
● Stream...
Query Layer 
● Hadoop/Storm/Spark have no query layer 
● Some db backend like redis to store the results 
Mikio L. Braun, ...
Lambda Architecture: Mixing 
Batch & Streaming 
http://lambda-architecture.net/ 
Mikio L. Braun, @mikiobraun Realtime Data...
Kappa Architecture 
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html 
Mikio L. Braun, @mikiobraun...
Scaling vs. Approximation 
● Scaling is expensive 
● Not all results are relevant 
● Data changes all the time anyway 
● A...
Approximation harmful? 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Heavy Hitters 
● Count activities over large item sets (millions, even 
more, e.g. IP addresses, Twitter users) 
● Interes...
Count Min Sketch 
● Summarize histograms over large feature sets 
● Like bloom filters, but better 
m bins 
0 0 3 0 
1 1 0...
Hyper Log Log 
● Hash stream to generate random bit strings 
● Look for infrequent events 
● If probability is one hundret...
Comparing Approx. Algorithms 
● Heavy Hitters: 
– approx. counts + top-k 
– large memory requirement 
● Count Min Sketch 
...
Exponential Decay 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Beyond Counting 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Streamdrill & Demos 
● Realtime Analysis Solutions 
● Core Engine: 
– Heavy Hitters + exponential decay + seconndary indic...
Example: Twitter Stock Analysis 
http://play.streamdrill.com/vis/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patt...
Example: Twitter Stock Analysis 
● Trends: 
– symbol:combinations $AAPL:$GOOG 
– symbol:hashtag $AAPL:#trading 
– symbol:k...
Example: Twitter Stock Analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Twitter 
streamdrill 
JavaScript 
via REST 
tweets 
Tweet Analyzer 
updates 
Mikio L. Bra...
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime user profiles 
● Process 10k events / second on one machine 
● Track about 1 Million counts per 1 GB 
● Shard by ...
Realtime Data Analysis Patterns 
● Acquisition / Processing / Query Layer 
● Acquisition: Flat files and distributed logs ...
Thank You 
Mikio Braun 
mikio@streamdrill.com 
@mikiobraun 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c...
Nächste SlideShare
Wird geladen in …5
×

Realtime Data Analysis Patterns

9.053 Aufrufe

Veröffentlicht am

Talk I gave at StratHadoop in Barcelona on November 21, 2014.

In this talk I discuss the experience we made with realtime analysis on high volume event data streams.

Veröffentlicht in: Internet
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Realtime Data Analysis Patterns

  1. 1. Realtime Data Analysis Patterns Mikio Braun @mikiobraun streamdrill & TU Berlin O'Really Strata+Hadoop, Barcelona Nov 21, 2014 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  2. 2. How it all started: Realtime Twitter Retweet Trends Rails app + PostgreSQL About 100 tweets/second,and it got worse Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  3. 3. Road from there ● Version 1.0: Rails + PostgreSQL – store and batch ● Version 2.0: Scala + Cassandra – stream processing & working data on disk ● Version 3.0: streamdrill – “in-memory realtime analytics database” – approximative algorithms to bound resources – moderate parallelism for some things Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  4. 4. Lessons learned? Not just one kind of realtime. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  5. 5. Applications FFiinnaannccee GGaammiinngg MMoonniittoorriinngg AAddvveerrttiissmmeenntt SSeennssoorr NNeettwwoorrkkss SSoocciiaall MMeeddiiaa Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  6. 6. Two Dimensions of Real-Time Complexity Latency ● counting ● trends ● outlier detection ● recommendation ● prediction (churn, etc.) ● now (ms, RTB) ● seconds (fraud) ● hours (monitoring) ● days (reporting) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  7. 7. What makes realtime hard ● Many Events – 100 events / second – 360k per hour – 8.6M per day – 260M per month – 3.2B per year ● Many Objects http://www.flickr.com/photos/arenamontanus/269158554/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  8. 8. Classes of Realtime ● Events per second (100s? 1000s? 10k?) ● Number of objects (A few dozen? Millions?) ● Complexity (Counting? Trends?) ● Latency (Milliseconds? Hours?) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  9. 9. General Architecture Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  10. 10. Data Acquisition ● Flat files / HDFS ● Apache Flume / Logstash ● Apache Kafka for distributed logging Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  11. 11. Processing ● Depending on Latency: Batch or Streaming ● Batch – Apache Hadoop – Apache Spark – Apache Flink ● Streaming – Apache Storm – Apache Samza Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  12. 12. Query Layer ● Hadoop/Storm/Spark have no query layer ● Some db backend like redis to store the results Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  13. 13. Lambda Architecture: Mixing Batch & Streaming http://lambda-architecture.net/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  14. 14. Kappa Architecture http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  15. 15. Scaling vs. Approximation ● Scaling is expensive ● Not all results are relevant ● Data changes all the time anyway ● Approximate: Trade accuracy for resource usage Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  16. 16. Approximation harmful? Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  17. 17. Heavy Hitters ● Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) ● Interested in most active elements only. frank paul jan felix leo alex 15 12 8 5 3 2 Fixed tables of counts Case 1: element already in data base paul paul 12 13 Case 2: new element nico alex 2 nico 3 Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  18. 18. Count Min Sketch ● Summarize histograms over large feature sets ● Like bloom filters, but better m bins 0 0 3 0 1 1 0 2 0 2 0 0 0 3 5 2 0 5 3 2 2 4 5 0 1 3 7 3 0 2 0 8 n different hash functions Updates for new entry Query result: 1 ● Query: Take minimum over all hash functions G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  19. 19. Hyper Log Log ● Hash stream to generate random bit strings ● Look for infrequent events ● If probability is one hundreths → should have seen 100 events on average if it occurs. ● Average to improve estimate. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  20. 20. Comparing Approx. Algorithms ● Heavy Hitters: – approx. counts + top-k – large memory requirement ● Count Min Sketch – approx. counts for all, but no top-k, no elements – needs to know size beforehand ● HyperLogLog – approx. number of distinct elements Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  21. 21. Exponential Decay Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  22. 22. Beyond Counting Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  23. 23. Streamdrill & Demos ● Realtime Analysis Solutions ● Core Engine: – Heavy Hitters + exponential decay + seconndary indices – Instant counts & top-k results over time windows – In-memory – Written in Scala ● Modules – Profiling and Trending – Recommendations – Count Distinct Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  24. 24. Example: Twitter Stock Analysis http://play.streamdrill.com/vis/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  25. 25. Example: Twitter Stock Analysis ● Trends: – symbol:combinations $AAPL:$GOOG – symbol:hashtag $AAPL:#trading – symbol:keywords $GOOG:disruption – symbol:mentions $GOOG:WallStreetCom – symbol trend $AAPL – symbol:url $FB:http://on.wsj.com/15fHaZW Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  26. 26. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  27. 27. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  28. 28. Example: Twitter Stock Analysis Twitter streamdrill JavaScript via REST tweets Tweet Analyzer updates Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  29. 29. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  30. 30. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  31. 31. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  32. 32. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  33. 33. Realtime user profiles ● Process 10k events / second on one machine ● Track about 1 Million counts per 1 GB ● Shard by user for higher accuracy Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  34. 34. Realtime Data Analysis Patterns ● Acquisition / Processing / Query Layer ● Acquisition: Flat files and distributed logs ● Processing: Scaling batch or streaming ● Query Layer: Separate query from processing ● Lambda and Kappa Architecture ● Approximation as alternative to scaling ● Trends with indices as building blocks for data analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  35. 35. Thank You Mikio Braun mikio@streamdrill.com @mikiobraun Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun

×