Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Interactive real-time dashboards on data streams using Kafka, Druid, and Superset

948 Aufrufe

Veröffentlicht am

When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real-time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases. NISHANT BANGARWA, Software engineer, Hortonworks

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Interactive real-time dashboards on data streams using Kafka, Druid, and Superset

  1. 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Interactive Realtime Dashboards on Data Streams Nishant Bangarwa Hortonworks Druid Committer, PMC
  2. 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Sample Data Stream : Wikipedia Edits
  3. 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
  4. 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Step by Step Breakdown Consume Events Enrich / Transform (Add Geolocation from IP Address) Store Events Visualize Events Sample Event : [[Eoghan Harris]] https://en.wikipedia.org/w/index.php?diff=792474242&oldid=787592607 * 7.114.169.238 * (+167) Added fact
  5. 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Required Components  Event Flow  Event Processing  Data Store  Visualization Layer
  6. 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Event Flow
  7. 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Event Flow : Requirements Event Producers Queue Event Consumers  Low latency  High Throughput  Failure Handling  Message delivery guarantees –  Atleast Once, Exactly once, Atmost Once  Varies based on use-case, Exactly once being the holy grail and the most difficult to achieve  Scalability
  8. 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Apache Kafka
  9. 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Apache Kafka  Low Latency  High Throughput  Message Delivery guarantees  At-least once  Exactly Once (Fully introduced in apache kafka v0.11.0 June 2017)  Reliable design to Handle Failures  Message Acks between producers and brokers  Data Replication on brokers  Consumers can Read from any desired offset  Handle multiple producers/consumers  Scalable
  10. 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Event Processing
  11. 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Event Processing : Requirements  Consume-Process-Produce Pattern  Enrich and Transform event streams  Windowing  Apply business logic  Consume and Join multiple streams into single  Failure Handling  Scalability Consume Process Produce
  12. 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Kafka Streams  Rich Lightweight Stream processing library  Event-at-a-time  Stateful processing : windowing, joining, aggregation operators  Local state using RocksDb  Backed by changelog in kafka  Highly scalable, distributed, fault tolerant  Compared to a standard Kafka consumer:  Higher level: faster to build a sophisticated app  Less control for very fine-grained consumption
  13. 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Kafka Streams : Wikipedia Data Enrichment
  14. 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Data Store
  15. 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Data Store : Requirements Processed Events Data Store Queries  Ability to ingest Streaming data  Power Interactive dashboards  Sub-Second Query Response time  Ad-hoc arbitrary slicing and dicing of data  Data Freshness  Summarized/aggregated data is queried  Scalability  High Availability
  16. 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Druid • Column-oriented distributed datastore • Sub-Second query times • Realtime streaming ingestion • Arbitrary slicing and dicing of data • Automatic Data Summarization • Approximate algorithms (hyperLogLog, theta) • Scalable to petabytes of data • Highly available
  17. 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Suitable Use Cases • Powering Interactive user facing applications • Arbitrary slicing and dicing of large datasets • User behavior analysis • measuring distinct counts • retention analysis • funnel analysis • A/B testing • Exploratory analytics/root cause analysis • Not interested in dumping entire dataset
  18. 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Druid: Segments • Data in Druid is stored in Segment Files. • Partitioned by time • Ideally, segment files are each smaller than 1GB. • If files are large, smaller time partitions are needed. Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5_2: Friday Segment 5_1: Friday
  19. 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved 1 Example Wikipedia Edit Dataset timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 Timestamp Dimensions Metrics
  20. 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved 2 Data Rollup timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 timestamp page language city country count sum_added sum_deleted min_added max_added …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12 Rollup by hour
  21. 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved 2 Dictionary Encoding • Create and store Ids for each value • e.g. page column ⬢ Values - Justin Bieber, Ke$ha, Selena Gomes ⬢ Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2 ⬢ Column Data - [0 0 0 1 1 2] • city column - [0 0 0 1 1 1] timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
  22. 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Bitmap Indices • Store Bitmap Indices for each value ⬢ Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0] ⬢ Ke$ha -> [3, 4] -> [0 0 0 1 1 0] ⬢ Selena Gomes -> [5] -> [0 0 0 0 0 1] • Queries ⬢ Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0] ⬢ language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1] • Indexes compressed with Concise or Roaring encoding timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99 2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
  23. 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Approximate Sketch Columns timestamp page userid language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53 timestamp page language city country count sum_added sum_delete d min_added Userid_sket ch …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch} 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch} 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch} Rollup by hour
  24. 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Approximate Algorithms • Store Sketch objects, instead of raw column values • Better rollup for high cardinality columns e.g userid • Reduced storage size • Use Cases • Fast approximate distinct counts • Approximate histograms • Funnel/retention analysis • Limitation • Not possible to do exact counts • filter on individual row values
  25. 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Realtime Nodes Historical Nodes Druid Architecture Batch Data Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Historical Nodes Handoff
  26. 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Performance and Scalability : Fast Facts Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  27. 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Companies Using Druid
  28. 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Visualization Layer
  29. 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Visualization Layer : Requirements • Rich dashboarding capabilities • Work with multiple datasoucres • Security/Access control • Allow for extension • Add custom visualizations Data Store Visualization Layer User Dashboards
  30. 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Superset • Python backend • Flask app builder • Authentication • Pandas for rich analytics • SqlAlchemy for SQL toolkit • Javascript frontend • React, NVD3 • Deep integration with Druid
  31. 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Superset Rich Dashboarding Capabilities: Treemaps
  32. 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Superset Rich Dashboarding Capabilities: Sunburst
  33. 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Superset UI Provides Powerful Visualizations Rich library of dashboard visualizations: Basic: • Bar Charts • Pie Charts • Line Charts Advanced: • Sankey Diagrams • Treemaps • Sunburst • Heatmaps And More!
  34. 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Wikipedia Real-Time Dashboard Kafka Connect IP-to- Geolocation Processor wikipedia-raw topic wikipedia-raw topic wikipedia-enriched topic wikipedia-enriched topic
  35. 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Project Websites  Kafka - http://kafka.apache.org  Druid - http://druid.io  Superset - http://superset.incubator.apache.org
  36. 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved Thank you Twitter - @NishantBangarwa Email - nbangarwa@hortonworks.com Linkedin - https://www.linkedin.com/in/nishant-bangarwa
  37. 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Questions?

×