Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Lessons Learned Building A Scalable
Self-serve, Real-time, Multi-tenant
Monitoring Service
PRESENTED BY Mridul Jain, Sumee...
Introduction
2
§  Big ML at Yahoo
§  Has used Storm and Kafka for real-time trend
analysis in search and central monitorin...
Acknowledgement
3
We want to acknowledge the contributions from Kapil Gupta and Arun Gupta,
Principal Architects with the ...
Agenda
4
Overview1
Transitioning from Classical to Real-time Big Data Architecture
Lessons Learned Scaling the Real-time B...
Introduction to Yahoo’s Monitoring as a Service
5
...
...
Infra Monitoring
CPU, disk, network
Host uptime
HTTP sess. error...
Classical Architecture – Pre Real-time Big Data Tech
6
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend...
Classical Architecture – Pre Real-time Big Data Tech
7
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend...
Classical Architecture – Pre Real-time Big Data Tech
8
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend...
Classical Architecture – Pre Real-time Big Data Tech
9
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend...
Classical Architecture – Pre Real-time Big Data Tech
10
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Fronten...
Classical Architecture – Pre Real-time Big Data Tech
11
H1
H2
H3
H4
H5
Collector Aggregator
Server
DB Server
Dashboard
Classical Architecture – Pre Real-time Big Data Tech
12
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A...
Classical Architecture – Pre Real-time Big Data Tech
13
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A...
Classical Architecture – Pre Real-time Big Data Tech
14
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A...
Classical Architecture – Pre Real-time Big Data Tech
15
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A...
Classical Architecture – Pre Real-time Big Data Tech
16
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A...
Architecture Based on Real-time Big Data Tech
17
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs
Architecture Based on Real-time Big Data Tech
18
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs
No manual partitionin...
Scale and Performance
19
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Te...
Scale and Performance
20
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Te...
Run semantic
and syntactic
validation CLI
Git commit, PR,
Merge
Scale and Performance
21
Data
Highway
Data Ingest
Topology...
Run semantic
and syntactic
validation CLI
Git commit, PR,
Merge
Scale and Performance
22
Data
Highway
Data Ingest
Topology...
Self Serve Rules
23
A	=	filter	*	where	namespace	==	“product1”	
and	application	==	“apache",60,3	
	
B	=	filter	*	where	nam...
Lessons Learned
24
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard...
Lessons Learned
25
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard...
Storm + Kafka Based Architecture
26
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Produ...
Scale of an Online Monitoring Solution
27
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2...
A Producer - Consumer Pipeline
28
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
...
A Producer - Consumer Pipeline
29
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
...
Monitoring Time Roll-ups
30
Topic in-mem state
Kafka Cluster
Spout Bolt
Storm
Topic in-mem state
Topic in-mem state
§  Hug...
Monitoring Time Roll-ups
31
Topic in-mem state
Kafka Cluster
Spout
Storm
Topic in-mem state
Topic in-mem state
§  Aggregat...
Kafka Refresh
32
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Each of the brokers may ...
Kafka Refresh
33
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in...
Kafka Refresh
34
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in...
Kafka Refresh
35
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in...
Kafka Refresh
36
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in...
Kafka Refresh
37
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in...
A Producer - Consumer Pipeline
38
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
...
Lessons Learned
39
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard...
Skewed Ingestion per Task
40
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
High rate of ingestion with a “Group...
Skewed Ingestion per Task
41
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
Overall state per task reduces due t...
Abuse
42
§  Max ingestion per TSDB - 120k/s
§  UID table hit hard due to high cardinality data
§  Lots of in-memory states...
Lessons Learned
43
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard...
ZooKeeper Scaling
44
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant...
Topology Scaling
45
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant ...
Trident Scaling
46
A = filter * where namespace == “ABC” and application == "XYZ",5,3
1 Rule 1 Logical Bolt
Trident accept...
Efficient Resourcing and Hardware Utilization
47
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Ten...
Re-queue Pipeline – Solution for Write Stability
48
Data Queue
6 Hrs
Requeue queue
24 Hrs
Kafka
Kafka
consumer
TSDB Async ...
Lessons Learned
49
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard...
Auto Retries
50
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Auto Retries
51
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
success
Evict the writ...
Auto Retries
52
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
failed
Retry to write ...
Auto Retries
53
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Failed/succes...
Auto Retries
54
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
retry
Failed/...
Auto Retries
55
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Given the add...
Auto Retries
56
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Overflow!!
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Loc...
Auto Retries
57
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Un...
Auto Retries
58
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
...
Auto Retries
59
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Lock
...
Auto Retries
60
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Response
As ...
Auto Retries
61
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Hangup !!
Thread
dies
Lock
Response
Lock
Lock
§  Thread...
Lessons Learned
62
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard...
Broker 3
Broker 1
Storm and Kafka – Broker Slowness
63
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Produ...
Broker 3
Broker 1
Storm and Kafka – Broker Slowness
64
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Produ...
DiskJVM OS Page Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§...
DiskJVM OS Page Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§...
DiskJVM
OS
Page
Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§...
DiskJVM OS Page Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§...
DiskJVM OS Page Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§...
Lessons Learned
70
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard...
Thank You
@mridul_jain
@sumeetksingh
Nächste SlideShare
Wird geladen in …5
×

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo

795 Aufrufe

Veröffentlicht am

Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.

Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo

  1. 1. Lessons Learned Building A Scalable Self-serve, Real-time, Multi-tenant Monitoring Service PRESENTED BY Mridul Jain, Sumeet Singh⎪ March 31, 2016 S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 6 , S a n J o s e
  2. 2. Introduction 2 §  Big ML at Yahoo §  Has used Storm and Kafka for real-time trend analysis in search and central monitoring §  Co-authored Pig on Storm §  Co-authored CaffeOnSpark for distributed deep learning Mridul Jain Senior Principal Architect Big Data and Machine Learning Science and Technology 701 First Avenue, Sunnyvale, CA 94089 USA @mridul_jain §  Manages Hadoop products team at Yahoo §  Responsible for Product Management, Strategy and Customer Engagements §  Managed Cloud Services products team and headed strategy functions for the Cloud Platform Group at Yahoo §  MBA from UCLA and MS from Rensselaer Polytechnic Institute (RPI) Sumeet Singh Sr. Director, Product Management Cloud and Big Data Platforms Science and Technology 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh
  3. 3. Acknowledgement 3 We want to acknowledge the contributions from Kapil Gupta and Arun Gupta, Principal Architects with the Yahoo Monitoring team to this presentation as well as the monitoring platform. We would also like to thank the entire Yahoo Monitoring and Hadoop and Big Data Platforms teams for making the next generation monitoring services a reality at Yahoo.
  4. 4. Agenda 4 Overview1 Transitioning from Classical to Real-time Big Data Architecture Lessons Learned Scaling the Real-time Big Data Stack Lessons Learned Optimizing for System Performance Q&A 2 3 4 5
  5. 5. Introduction to Yahoo’s Monitoring as a Service 5 ... ... Infra Monitoring CPU, disk, network Host uptime HTTP sess. errors Hosts Apps App Monitoring Req. per second Avg. latency API access errors Hosted Multi-tenant Monitoring Service Collection Storage Scheduling Coordination Alerts / Thresholds Dashboards Aggregation
  6. 6. Classical Architecture – Pre Real-time Big Data Tech 6 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query
  7. 7. Classical Architecture – Pre Real-time Big Data Tech 7 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query Large Fan-out1
  8. 8. Classical Architecture – Pre Real-time Big Data Tech 8 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query Large Fan-out1 Manually Sharded DBs2
  9. 9. Classical Architecture – Pre Real-time Big Data Tech 9 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query Large Fan-out1 Manually Sharded DBs2 Massive Query Federation3
  10. 10. Classical Architecture – Pre Real-time Big Data Tech 10 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query Large Fan-out1 Manually Sharded DBs2 Massive Query Federation3 ✗ Manageability Challenges
  11. 11. Classical Architecture – Pre Real-time Big Data Tech 11 H1 H2 H3 H4 H5 Collector Aggregator Server DB Server Dashboard
  12. 12. Classical Architecture – Pre Real-time Big Data Tech 12 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Manual partitioning of hosts 1
  13. 13. Classical Architecture – Pre Real-time Big Data Tech 13 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Manual partitioning of hosts 1 Single threaded agg. /cluster Seq. processing of rules 4M DP/min per agg. 2
  14. 14. Classical Architecture – Pre Real-time Big Data Tech 14 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Single threaded agg. /cluster Seq. processing of rules 4M DP/min per agg. 2Manual partitioning of hosts 1 1 shard / cluster 1.5M DP/min 3
  15. 15. Classical Architecture – Pre Real-time Big Data Tech 15 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Single threaded agg. /cluster Seq. processing of rules 4M DP/min per agg. 2Manual partitioning of hosts 1 1 shard / cluster 1.5M DP/min 3 Seq. fetch for federated queries 4
  16. 16. Classical Architecture – Pre Real-time Big Data Tech 16 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Single threaded agg. /cluster Seq. processing of rules 4M DP/min per agg. 2Manual partitioning of hosts 1 1 shard / cluster 1.5M DP/min 3 Seq. fetch for federated queries 4 ✗ Scale Challenges ✗ Availability Challenges
  17. 17. Architecture Based on Real-time Big Data Tech 17 Hosts Collectors Data Highway UI Dashboard & Graphs
  18. 18. Architecture Based on Real-time Big Data Tech 18 Hosts Collectors Data Highway UI Dashboard & Graphs No manual partitioning / sharing Built-in horizontal scalability Built-in High-availability ✔ Manageability ✔ Scalability ✔ Availability Standard Big Data Frameworks
  19. 19. Scale and Performance 19 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs
  20. 20. Scale and Performance 20 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs §  Low latency real-time processing §  5x scale than the previous architecture §  Massive parallelism and pipelining §  Real-time aggregation, thresholds and alerts §  Support for larger historic data lookup & processing §  Support for self-serve complex processing, data slicing and dicing §  Pluggable algo and ML models (e.g. EGADS)
  21. 21. Run semantic and syntactic validation CLI Git commit, PR, Merge Scale and Performance 21 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs Git CI / CD A = Filter from * where host regex … /alert_policy/kpis.yaml /contacts/oc.yaml /rules/system.yo Alerts to OC, correlators and mailing lists
  22. 22. Run semantic and syntactic validation CLI Git commit, PR, Merge Scale and Performance 22 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs Git CI / CD A = Filter from * where host regex … /alert_policy/kpis.yaml /contacts/oc.yaml /rules/system.yo ✔ Self-serve Easy Deploys ✔ Real-time Alerting Alerts to OC, correlators and mailing lists
  23. 23. Self Serve Rules 23 A = filter * where namespace == “product1” and application == “apache",60,3 B = filter * where namespace == “product2” and Tag.host in (“hostgrp1”,”hostgrp4”) C = threshold A Metric.monstatus.latency < 2 as "mycheck" Store C alert C , $LatencyAlertConfig, $NotificationID , LOW, $UrlID, $CustMessageID §  Simple and rich processing language with custom UDF support for algos and statistical functions §  Support for arithmetic, set, stats operators, groupby, joins etc. §  Events from different namespaces can be combined §  Thresholds and policies, notifications contact, severity in a simple hot deployable fashion §  Store relations and calculations as you like §  Automatically track all the good, bad, and missing events
  24. 24. Lessons Learned 24 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  25. 25. Lessons Learned 25 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  26. 26. Storm + Kafka Based Architecture 26 Central Collector (no spooling) Spout with Jetty Servlet Bolt Product1 Product 2 Product N 133topics Storm Kafka HTTP POST
  27. 27. Scale of an Online Monitoring Solution 27 Central Collector (no spooling) Spout with Jetty Servlet Bolt Product1 Product 2 Product N 133topics Storm Kafka HTTP POST §  400 bolt tasks in 40 workers TSDB_1 TSDB_2 TSDB_3 §  450 topologies §  15 topics /topology §  3 partitions /topic §  3 TSDB topics §  222 partitions per topic
  28. 28. A Producer - Consumer Pipeline 28 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs
  29. 29. A Producer - Consumer Pipeline 29 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs §  Excellent E2E Synchronization §  Provides a breather against individual component failures §  Reasonably good performance inspite of transient failures §  Can help individual components to scale, if used smartly
  30. 30. Monitoring Time Roll-ups 30 Topic in-mem state Kafka Cluster Spout Bolt Storm Topic in-mem state Topic in-mem state §  Huge in-memory state §  220 million/min * 60 §  Trident issues §  High network à high CPU
  31. 31. Monitoring Time Roll-ups 31 Topic in-mem state Kafka Cluster Spout Storm Topic in-mem state Topic in-mem state §  Aggregate in Spout §  220 million/min * 60 §  Fields grouping in kafka for a time series Producer
  32. 32. Kafka Refresh 32 Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster
  33. 33. Kafka Refresh 33 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3
  34. 34. Kafka Refresh 34 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins §  For each topic fetch call there is a timeout of 10 secs which is a blocking call on main producer thread Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3
  35. 35. Kafka Refresh 35 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins §  For each topic fetch call there is a timeout of 10 secs which is a blocking call on main producer thread §  If there are 100 topics and a broker is down(sock time out), this gets blocked for 1000s > next refresh cycle (10mins) Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3
  36. 36. Kafka Refresh 36 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins §  For each topic fetch call there is a timeout of 10 secs which is a blocking call on main producer thread §  If there are 100 topics and a broker is down(sock time out), this gets blocked for 1000s > next refresh cycle (10mins) §  Effectively hangs the producer Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3
  37. 37. Kafka Refresh 37 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins §  For each topic fetch call there is a timeout of 10 secs which is a blocking call on main producer thread §  If there are 100 topics and a broker is down(sock time out), this gets blocked for 1000s > next refresh cycle (10mins) §  Effectively hangs the producer Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3 Disable refresh If broker is down anyway the producer apis get it from an alternate broker
  38. 38. A Producer - Consumer Pipeline 38 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs §  Excellent E2E Synchronization §  Provides a breather against individual component failures §  Reasonably good performance inspite of transient failures §  Can help individual components to scale, if used smartly §  Queuing system is your last line of defense, choose wisely
  39. 39. Lessons Learned 39 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  40. 40. Skewed Ingestion per Task 40 Spout bolt A1 bolt A2 bolt A3 bolt B1 bolt B2 22 M / min High rate of ingestion with a “Group By” on limited dimensions will direct all events for a specific dimension to one task
  41. 41. Skewed Ingestion per Task 41 Spout bolt A1 bolt A2 bolt A3 bolt B1 bolt B2 22 M / min Overall state per task reduces due to combiners sharing the original big state and also aggregating it before fwding to final bolts, thus reducing their overall state Each of the combiners maintain local state for each of the dimensions and fwds the aggregated count to B1 or B2 com 1 com 2 com 3 Shuffle Partition By
  42. 42. Abuse 42 §  Max ingestion per TSDB - 120k/s §  UID table hit hard due to high cardinality data §  Lots of in-memory states created in Storm bolts
  43. 43. Lessons Learned 43 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  44. 44. ZooKeeper Scaling 44 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs ZK - Storm §  Kafka consumer swap in-out create heavy churn in ZK state for kafka brokers §  Every time a consumer enter/leaves, all consumers query the group state from ZK §  Same for rolling upgrade for kafka, restarts, any bad behaviour by consumers ZK - Kafka Single Cluster for Agg.
  45. 45. Topology Scaling 45 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs Single Cluster for Agg.
  46. 46. Trident Scaling 46 A = filter * where namespace == “ABC” and application == "XYZ",5,3 1 Rule 1 Logical Bolt Trident accepts < 400 rules per topology : 400 logical Trident UDFs §  zookeeper jute size §  tunable but leads to performance issues : nimbus OOM, worker heartbeat slowness etc. Eg: 1200 rules will need about 3 trident topologies
  47. 47. Efficient Resourcing and Hardware Utilization 47 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Cluster 1 UI Dashboard & Graphs Cluster 2 Rollup topology - all tenants System, Abuse topologies Isolation
  48. 48. Re-queue Pipeline – Solution for Write Stability 48 Data Queue 6 Hrs Requeue queue 24 Hrs Kafka Kafka consumer TSDB Async HBase lib HBase UID Lookups UID table unavailable No response NSRE §  Region splits & hotspots §  NSREs & GCs §  Region unresponsive §  Region unavailability §  Load rebalancing §  Region queue size max- out
  49. 49. Lessons Learned 49 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  50. 50. Auto Retries 50 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts
  51. 51. Auto Retries 51 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool success Evict the written rpc from cache
  52. 52. Auto Retries 52 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool failed Retry to write to HBase by looking up the RPC in the cache
  53. 53. Auto Retries 53 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool Callback Failed/success Given the additional job of handling the removed / expired entry Timed-out RPCs
  54. 54. Auto Retries 54 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool Callback retry Failed/success Timed-out RPCs Given the additional job of handling the removed / expired entry Put it back in cache
  55. 55. Auto Retries 55 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool Callback Given the additional job of removing expired entry retry Failed/success Stack Overflow!! Timed-out RPCs
  56. 56. Auto Retries 56 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Overflow!! Lock Lock Lock Lock Lock Lock Lock Lock Lock Response ✓ ✓ Timed-out RPCs
  57. 57. Auto Retries 57 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Unwind Lock Lock Lock Lock Lock Lock Lock Lock Unlock Response No space in stack!! Throws exception ✓ ✓ Timed-out RPCs
  58. 58. Auto Retries 58 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Unwind Lock Lock Lock Lock Lock Lock Lock Unlock Lock Response No space in stack!! Throws exception ✓ ✓ Timed-out RPCs
  59. 59. Auto Retries 59 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Unwind Lock Lock Lock Lock Lock Lock Unlock Lock Response No space in stack!! Throws exception Lock ✓ ✓ Timed-out RPCs
  60. 60. Auto Retries 60 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Unwind Lock Lock Lock Lock Lock Lock Response As stack has unwinded to some extent, we get space to call Unlock now Lock Lock ✓ ✓ Timed-out RPCs
  61. 61. Auto Retries 61 HBase Writer Thread Pool Inserts Netty Thread Pool Hangup !! Thread dies Lock Response Lock Lock §  Thread is dead §  3 locks remaining §  No thread can write/insert as the cache is locked §  Guava cache hung, TSDB hung!! ✓ ✓ Timed-out RPCs
  62. 62. Lessons Learned 62 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  63. 63. Broker 3 Broker 1 Storm and Kafka – Broker Slowness 63 Central Collector (no spooling) Spout with Jetty Servlet Bolt Product1 Product 2 Storm Kafka HTTP POST §  bolt thread writes to in-mem kafka queue async §  during slowness of even one broker if this queue fills up, it blocks the producer bolt thread, which in turn back pressures upstream TSDB_1 TSDB_2 §  133 topologies §  15 topics per topology §  3 partitions per topic §  3 TSDB topics §  222 partitions per topic §  22 Kafka brokers §  If we have no spooling we lose the data even if broker recovers, else replay saves the day Broker 2 Product2 Product 3
  64. 64. Broker 3 Broker 1 Storm and Kafka – Broker Slowness 64 Central Collector (no spooling) Spout with Jetty Servlet Bolt Product1 Product 2 Storm Kafka HTTP POST §  bolt thread writes to in-mem kafka queue async §  during slowness of even one broker if this queue fills up, it blocks the producer bolt thread, which in turn back pressures upstream TSDB_1 TSDB_2 §  133 topologies §  15 topics per topology §  3 partitions per topic §  3 TSDB topics §  222 partitions per topic §  22 Kafka brokers §  If we have no spooling we lose the data even if broker recovers, else replay saves the day Broker 2 Product2 Product 3 ✓ Better Monitoring
  65. 65. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Writes from producer Reads from consumer Storm and Kafka – Broker Slowness
  66. 66. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Storm and Kafka – Broker Slowness U N U S E D Contents swapped to disk
  67. 67. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Storm and Kafka – Broker Slowness Maximize page cache U N U S E D
  68. 68. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Storm and Kafka – Broker Slowness Contents swapped back from disk GC kicks in for swapped out objects
  69. 69. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Storm and Kafka – Broker Slowness Contents swapped back from disk GC kicks in for swapped out objects Writes High RPS pipeline will see heavy backpressure and data will get dropped VM.Swapiness
  70. 70. Lessons Learned 70 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  71. 71. Thank You @mridul_jain @sumeetksingh

×