Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using Kafka (Chunky Gupta and Osman Sarood, Mist Systems) Kafka Summit NYC 2019

Osman Sarood 
Infrastructure and Distributed Systems Lead, Mist Systems

Chunky Gupta  
Distributed Systems Engineer, Mist Systems
Cost Eﬀectively and Reliably Aggregating
Billions of Messages Per Day Using Apache
Kafka®

Mist Architecture
1 TB+
10 Billion+ Msgs
10’s TB+
500+ partitions
Mist Architecture
Live Aggregators: Real-time Aggregation System
80% DC on Spot 
70% cheaper (reserved)

Acknowledgement
Amarinder Singh Bindra
Ebrahim SafaviJitendra Harlalka

• How do we aggregate?
• Live Aggregators architecture
• Autoscaling
• Multi-level Aggregations
Outline

Realtime Processing/Aggregation

What Live Aggregators is forYou?

What Live Aggregators is forYou? (contd ..)
Total Time Series: 2 # Aggregation Operations: 8

• View : A set of tuples which contain aggregated data for deﬁned time interval based
on user-deﬁned groupings
Terminologies
• Grouping Columns : Columns to consider as Aggregation keys
• Aggregation Info : Type of aggregation, aggregation on what, etc
• Time Series : Series of data points for a grouping cols in time order
Sum Count
Percentiles Median
Average Distinct Count
SpatialCount ??
20+ Aggregation Types

Live Aggregators Architecture
LA Data Store

Process 1: Kafka ReaderProcess 2: Shared Memory Manager
Process 3: View Runner 1 Process 4: View Runner 2
Live Aggregators Executor
Process 1: Kafka ReaderProcess 2: Shared Memory Manager
Process 3: View Runner 1 Process 4: View Runner 2
Time
Interval
Org num_clients total_bytes_tx
00:00-00:10 Mist 1 100
Time Interval Org max_bytes_tx
00:00-00:10 Mist 100
Time
Interval
00:00-00:10 Mist 2 160
00:00-00:10 Mist 100
View 1 State View 2 State
View 1 State
View 2 State
Time
Interval
00:00-00:10 Mist 2 160
00:00-00:10 Mist 100
View 1 State View 2 State
Checkpoint
Fetch
Checkpoint
S3
EC2 Spot Instances
Msg# 1
Client: Sam
Bytes_tx: 100
Org: Mist
Msg# 1
Client: Sam
Bytes_tx: 100
Org: Mist
Process 2: Shared Memory Manager
Msg# 2
Client: John
Bytes_tx: 60
Org: Mist
Msg# 2
Client: John
Bytes_tx: 60
Org: Mist
Msg# 3
Client: Ayaana
Bytes_tx: 20
Org: Home

Component State
View 1 Running
View 2 Running
View 3 Running
Autoscaling : Live Aggregators Scheduler
LA Scheduler
View1 View2 View3
View Queue
ZookeeperManager
Task Manager
LA Task 1
Component State
View 1 Waiting
View 2 Waiting
View 3 Waiting
View1
View1
View1
LA Task 1
View1
View1
View1
View 1: Partition 1
View 2: Partition 1
View 3: Partition 1
LA Task 1
View 1: Partition 1
View 2: Partition 1
View 3: Partition 2
LA Task 1
Component State
View 1 Picked
View 2 Picked
View 3 Picked

Live Aggregators Scale
• Message consumption rate from Kafka : 25 Billion+ reads per day
~620k
Messages Per Sec
~480k
Messages Per Sec

Live Aggregators Scale (contd ..)
• Number of Time Series : 300 Million+ at peak times • Aggregation Operations : 2 Million+ at peak times

Live Aggregators Scale (contd ..)
• Memory Footprint : 2.5 TB+ at peak times • Writes to Cassandra : 4 Billion+ writes per day

Reliable?
Cost Effective?
Scalable?

Reliability 24*7
Spot Fleet
Controlled Chaos
(Stop and resume)
Uncontrolled Chaos

Spot MarketVolatility
800 Spot instances terminated in a single day! (more than our production DC)

Live Aggregators Controller
Lag = Timestamp of Most Recent Produced Msg - Timestamp of Last Msg LA processed
Msg # Offset Timestamp Lag (sec)
1 10 4:59:00 pm 60
2 11 4:59:30 pm 30
3 12 4:59:55 pm 5
4 13 5:00:00 pm 0

Dynamic Load (Trend vs Seasonality)
Daily Seasonality
Trend

Autoscaling : Live Aggregators Executor
0.8 cores 0.6 cores
0.2 cores 0.2 cores
1.8 cores
Component Cores
Kafka Reader 0.2
Shared memory (per view) 0.1
View 1 0.8
View 2 0.6
View 3 0.9

LA Task 1
Autoscaling : Live Aggregators Scheduler
LA Scheduler
View1
0.8 cores
View2
0.6 cores
View3
0.9 cores
View Queue
ZookeeperManager
Task Manager
View1
0.8 cores
View2
0.6 cores
Core Available
2.01.80.90.2
LA Task 1
View1
0.8 cores
View2
0.6 cores
Component Cores
Kafka Reader 0.2
Shared memory (per view) 0.1
View 1 0.8
View 2 0.6
View 3 0.9
Cores
Reserved
1.8
Offer: 2 cores
KR
0.2 cores
SMM
0.1 cores
SMM
0.2 cores
KR
0.2 cores
SMM
0.2 cores

Lying Factor
Lying Factor = #Cores reserved - #Cores used
Lying Factor
Time0
Component
Evening Load
(Cores)
Kafka Reader 0.2
Shared memory (per view) 0.1*2
View 1 0.8
View 2 0.6
Total Cores for LA Task 1.8
Reserved Cores 1.8
Lying Factor 0
High Load
(Cores)
0.3
0.15
0.9
0.7
2.2
1.8
-0.4

• Lower Threshold = -0.05 Cores
• Upper Threshold = 0.20 Cores
Autoscaler: No Scaling

Autoscaler: Scale Up
Noisy Neighbor!!

Autoscaler: Scale Down

Autoscaling Effectiveness
• Resources UsedVs Reserved (Seasonality)
1000 cores

Multi Level Aggregation (Heatmap Example)
Device
Mist Ofﬁce
• Each device location every
second to Kafka 
• Client Density Heatmap 
• Sharded by Client ID
across multiple partitions

LA Task 2
Topic:1 partition: 1
Multi Level Aggregation
0 0 1 0
0 4 6 0
1 0 1 0
0 0 0 0
0 0 0 0
0 0 1 0
0 1 1 1
0 0 3 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 2
0 0 1 0
0 0 0 0
1 4 0 0
0 0 2 0
0 5 7 3
1 0 5 0
0 0 0 0
2 4 0 0
LA Task 1
LA Task 3
LA Task 4
Consume: Topic 1
Produce: Topic 2
Consume: Topic 2

Multi Level Aggregation: Client Density for a School
We will be adding the architecture diagram for this to explain

Future Work
1.Joining multiple streams 
2.Instance speciﬁc resource allocation 
3.Improving shared memory usage using Go 
4.Dynamic rescheduling of views to improve
Kafka load

Rate today’s session
Thank You!

Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using Kafka (Chunky Gupta and Osman Sarood, Mist Systems) Kafka Summit NYC 2019

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using Kafka (Chunky Gupta and Osman Sarood, Mist Systems) Kafka Summit NYC 2019

Ähnlich wie Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using Kafka (Chunky Gupta and Osman Sarood, Mist Systems) Kafka Summit NYC 2019 (20)

Mehr von confluent

Mehr von confluent (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using Kafka (Chunky Gupta and Osman Sarood, Mist Systems) Kafka Summit NYC 2019