In this session, we will discuss Live Aggregators (LA), Mist’s highly reliable and massively scalable in-house real time aggregation system that relies on Kafka for ensuring fault tolerance and scalability. LA consumes billions of messages a day from Kafka with a memory footprint of over 750 GB and aggregates over 100 million timeseries. Since it runs entirely on top of AWS spot instances, it is designed to be highly reliable. LA can recover from hours long complete EC2 outages using its checkpointing mechanism that depends on Kafka. This recovery mechanism recovers the checkpoint and replays messages from Kafka where it left off, ensuring no data loss. The characteristic that sets LA apart is its ability to autoscale by intelligently learning about resource usage and allocating resources accordingly. LA emits custom metrics that track resource usage for different components, i.e., Kafka consumer, shared memory manager and aggregator, to achieve server utilization of over 70%. We do multi-level aggregations in LA to intelligently solve load imbalance issues amongst different partitions for a Kafka topic. We’d demonstrate multi-level aggregation using an example in which we aggregate indoor location data coming from different organizations both spatially and temporally. We’d explain how changing partitioning key, along with writing intermediate data back to Kafka in a new topic for the next level aggregators helps Mist scale our solution. LA runs on top of 400+ cores, comprised of 10+ different Amazon EC2 spot instance types/sizes. We track the CPU usage for reading each Kafka stream on all the different instance types/sizes. We have several months of such data from our production Mesos cluster, which we are incorporating into LA’s scheduler to improve our server utilization and avoid CPU hot spots from developing on our cluster. Detailed Blog:https://www.mist.com/live-aggregators-highly-reliable-massively-scalable-real-time-aggregation-system/
Flexible Authentication Strategies with SASL/OAUTHBEARER (Michael Kaminski, T...
Ähnlich wie Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using Kafka (Chunky Gupta and Osman Sarood, Mist Systems) Kafka Summit NYC 2019
Ähnlich wie Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using Kafka (Chunky Gupta and Osman Sarood, Mist Systems) Kafka Summit NYC 2019 (20)
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using Kafka (Chunky Gupta and Osman Sarood, Mist Systems) Kafka Summit NYC 2019
1. Osman Sarood
Infrastructure and Distributed Systems Lead, Mist Systems
Chunky Gupta
Distributed Systems Engineer, Mist Systems
Cost Effectively and Reliably Aggregating
Billions of Messages Per Day Using Apache
Kafka®
2. Mist Architecture
1 TB+
10 Billion+ Msgs
10’s TB+
500+ partitions
Mist Architecture
Live Aggregators: Real-time Aggregation System
80% DC on Spot
70% cheaper (reserved)
7. What Live Aggregators is forYou? (contd ..)
Total Time Series: 2 # Aggregation Operations: 8
8. • View : A set of tuples which contain aggregated data for defined time interval based
on user-defined groupings
Terminologies
• Grouping Columns : Columns to consider as Aggregation keys
• Aggregation Info : Type of aggregation, aggregation on what, etc
• Time Series : Series of data points for a grouping cols in time order
Sum Count
Percentiles Median
Average Distinct Count
SpatialCount ??
20+ Aggregation Types
18. Live Aggregators Controller
Lag = Timestamp of Most Recent Produced Msg - Timestamp of Last Msg LA processed
Msg # Offset Timestamp Lag (sec)
1 10 4:59:00 pm 60
2 11 4:59:30 pm 30
3 12 4:59:55 pm 5
4 13 5:00:00 pm 0
30. Multi Level Aggregation (Heatmap Example)
Device
Mist Office
• Each device location every
second to Kafka
• Client Density Heatmap
• Sharded by Client ID
across multiple partitions
32. Multi Level Aggregation: Client Density for a School
We will be adding the architecture diagram for this to explain
33. Future Work
1.Joining multiple streams
2.Instance specific resource allocation
3.Improving shared memory usage using Go
4.Dynamic rescheduling of views to improve
Kafka load