This document discusses lessons learned from building a scalable, self-serve, real-time, multi-tenant monitoring service at Yahoo. It describes transitioning from a classical architecture to one based on real-time big data technologies like Storm and Kafka. Key lessons include properly handling producer-consumer problems at scale, challenges of debugging skewed data, strategically managing multi-tenancy and resources, issues optimizing asynchronous systems, and not neglecting assumptions outside the application.
Ähnlich wie Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Ähnlich wie Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo (20)
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo
1. Lessons Learned Building A Scalable
Self-serve, Real-time, Multi-tenant
Monitoring Service
PRESENTED BY Mridul Jain, Sumeet Singh⎪ March 31, 2016
S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 6 , S a n J o s e
2. Introduction
2
§ Big ML at Yahoo
§ Has used Storm and Kafka for real-time trend
analysis in search and central monitoring
§ Co-authored Pig on Storm
§ Co-authored CaffeOnSpark for distributed deep
learning
Mridul Jain
Senior Principal Architect
Big Data and Machine Learning
Science and Technology
701 First Avenue,
Sunnyvale, CA 94089 USA
@mridul_jain
§ Manages Hadoop products team at Yahoo
§ Responsible for Product Management, Strategy and
Customer Engagements
§ Managed Cloud Services products team and headed
strategy functions for the Cloud Platform Group at
Yahoo
§ MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Sr. Director, Product Management
Cloud and Big Data Platforms
Science and Technology
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
3. Acknowledgement
3
We want to acknowledge the contributions from Kapil Gupta and Arun Gupta,
Principal Architects with the Yahoo Monitoring team to this presentation as well
as the monitoring platform.
We would also like to thank the entire Yahoo Monitoring and Hadoop and
Big Data Platforms teams for making the next generation monitoring services
a reality at Yahoo.
4. Agenda
4
Overview1
Transitioning from Classical to Real-time Big Data Architecture
Lessons Learned Scaling the Real-time Big Data Stack
Lessons Learned Optimizing for System Performance
Q&A
2
3
4
5
5. Introduction to Yahoo’s Monitoring as a Service
5
...
...
Infra Monitoring
CPU, disk, network
Host uptime
HTTP sess. errors
Hosts
Apps
App Monitoring
Req. per second
Avg. latency
API access errors
Hosted Multi-tenant
Monitoring
Service
Collection
Storage
Scheduling
Coordination
Alerts /
Thresholds
Dashboards
Aggregation
6. Classical Architecture – Pre Real-time Big Data Tech
6
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
7. Classical Architecture – Pre Real-time Big Data Tech
7
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
8. Classical Architecture – Pre Real-time Big Data Tech
8
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2
9. Classical Architecture – Pre Real-time Big Data Tech
9
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2
Massive Query Federation3
10. Classical Architecture – Pre Real-time Big Data Tech
10
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2
Massive Query Federation3
✗ Manageability Challenges
11. Classical Architecture – Pre Real-time Big Data Tech
11
H1
H2
H3
H4
H5
Collector Aggregator
Server
DB Server
Dashboard
12. Classical Architecture – Pre Real-time Big Data Tech
12
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Manual partitioning of
hosts
1
13. Classical Architecture – Pre Real-time Big Data Tech
13
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Manual partitioning of
hosts
1 Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2
14. Classical Architecture – Pre Real-time Big Data Tech
14
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3
15. Classical Architecture – Pre Real-time Big Data Tech
15
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3 Seq. fetch for
federated queries
4
16. Classical Architecture – Pre Real-time Big Data Tech
16
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3 Seq. fetch for
federated queries
4
✗ Scale Challenges ✗ Availability Challenges
17. Architecture Based on Real-time Big Data Tech
17
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs
18. Architecture Based on Real-time Big Data Tech
18
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs
No manual partitioning / sharing
Built-in horizontal scalability
Built-in High-availability
✔ Manageability
✔ Scalability
✔ Availability
Standard Big Data Frameworks
24. Lessons Learned
24
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
25. Lessons Learned
25
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
26. Storm + Kafka Based Architecture
26
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Product N
133topics
Storm
Kafka
HTTP POST
27. Scale of an Online Monitoring Solution
27
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Product N
133topics
Storm
Kafka
HTTP POST
§ 400 bolt tasks in 40
workers
TSDB_1
TSDB_2
TSDB_3
§ 450 topologies
§ 15 topics /topology
§ 3 partitions /topic
§ 3 TSDB topics
§ 222 partitions per
topic
29. A Producer - Consumer Pipeline
29
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§ Excellent E2E Synchronization
§ Provides a breather against individual component failures
§ Reasonably good performance inspite of transient failures
§ Can help individual components to scale, if used smartly
30. Monitoring Time Roll-ups
30
Topic in-mem state
Kafka Cluster
Spout Bolt
Storm
Topic in-mem state
Topic in-mem state
§ Huge in-memory state
§ 220 million/min * 60
§ Trident issues
§ High network à high CPU
31. Monitoring Time Roll-ups
31
Topic in-mem state
Kafka Cluster
Spout
Storm
Topic in-mem state
Topic in-mem state
§ Aggregate in Spout
§ 220 million/min * 60
§ Fields grouping in kafka for a time series
Producer
32. Kafka Refresh
32
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
33. Kafka Refresh
33
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
34. Kafka Refresh
34
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§ For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
35. Kafka Refresh
35
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§ For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
§ If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
36. Kafka Refresh
36
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§ For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
§ If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
§ Effectively hangs the producer
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
37. Kafka Refresh
37
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§ A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§ For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
§ If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
§ Effectively hangs the producer
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Disable refresh
If broker is down anyway the
producer apis get it from an
alternate broker
38. A Producer - Consumer Pipeline
38
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§ Excellent E2E Synchronization
§ Provides a breather against individual component failures
§ Reasonably good performance inspite of transient failures
§ Can help individual components to scale, if used smartly
§ Queuing system is your last line of defense, choose wisely
39. Lessons Learned
39
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
40. Skewed Ingestion per Task
40
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
High rate of ingestion with a “Group By” on limited dimensions will direct all
events for a specific dimension to one task
41. Skewed Ingestion per Task
41
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
Overall state per task reduces due to combiners sharing the original big state and
also aggregating it before fwding to final bolts, thus reducing their overall state
Each of the combiners maintain local
state for each of the dimensions and
fwds the aggregated count to B1 or B2
com 1
com 2
com 3
Shuffle Partition By
42. Abuse
42
§ Max ingestion per TSDB - 120k/s
§ UID table hit hard due to high cardinality data
§ Lots of in-memory states created in Storm bolts
43. Lessons Learned
43
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
44. ZooKeeper Scaling
44
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
ZK - Storm
§ Kafka consumer swap in-out create heavy churn in ZK state for kafka brokers
§ Every time a consumer enter/leaves, all consumers query the group state from ZK
§ Same for rolling upgrade for kafka, restarts, any bad behaviour by consumers
ZK - Kafka
Single Cluster
for Agg.
48. Re-queue Pipeline – Solution for Write Stability
48
Data Queue
6 Hrs
Requeue queue
24 Hrs
Kafka
Kafka
consumer
TSDB Async HBase lib HBase
UID Lookups
UID table unavailable
No response
NSRE
§ Region splits & hotspots
§ NSREs & GCs
§ Region unresponsive
§ Region unavailability
§ Load rebalancing
§ Region queue size max-
out
49. Lessons Learned
49
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
53. Auto Retries
53
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Failed/success
Given the additional job of handling the
removed / expired entry
Timed-out RPCs
54. Auto Retries
54
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
retry
Failed/success
Timed-out RPCs
Given the additional job of handling the
removed / expired entry
Put it back in cache
55. Auto Retries
55
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Given the additional job of
removing expired entry
retry
Failed/success
Stack Overflow!!
Timed-out RPCs
56. Auto Retries
56
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Overflow!!
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Response
✓ ✓
Timed-out RPCs
57. Auto Retries
57
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Response
No space in stack!!
Throws exception
✓ ✓
Timed-out RPCs
58. Auto Retries
58
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Lock
Response
No space in stack!!
Throws exception
✓ ✓
Timed-out RPCs
59. Auto Retries
59
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Lock
Response
No space in stack!!
Throws exception
Lock
✓ ✓
Timed-out RPCs
60. Auto Retries
60
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Response
As stack has unwinded to some extent,
we get space to call Unlock now
Lock
Lock
✓ ✓
Timed-out RPCs
61. Auto Retries
61
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Hangup !!
Thread
dies
Lock
Response
Lock
Lock
§ Thread is dead
§ 3 locks remaining
§ No thread can write/insert as the cache is locked
§ Guava cache hung, TSDB hung!!
✓ ✓
Timed-out RPCs
62. Lessons Learned
62
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
63. Broker 3
Broker 1
Storm and Kafka – Broker Slowness
63
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Storm
Kafka
HTTP POST
§ bolt thread writes to in-mem
kafka queue async
§ during slowness of even one
broker if this queue fills up, it
blocks the producer bolt
thread, which in turn back
pressures upstream
TSDB_1
TSDB_2
§ 133 topologies
§ 15 topics per
topology
§ 3 partitions per
topic
§ 3 TSDB topics
§ 222 partitions per
topic
§ 22 Kafka brokers
§ If we have no
spooling we lose the
data even if broker
recovers, else
replay saves the day
Broker 2
Product2
Product 3
64. Broker 3
Broker 1
Storm and Kafka – Broker Slowness
64
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Storm
Kafka
HTTP POST
§ bolt thread writes to in-mem
kafka queue async
§ during slowness of even one
broker if this queue fills up, it
blocks the producer bolt
thread, which in turn back
pressures upstream
TSDB_1
TSDB_2
§ 133 topologies
§ 15 topics per
topology
§ 3 partitions per
topic
§ 3 TSDB topics
§ 222 partitions per
topic
§ 22 Kafka brokers
§ If we have no
spooling we lose the
data even if broker
recovers, else
replay saves the day
Broker 2
Product2
Product 3
✓ Better Monitoring
65. DiskJVM OS Page Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Writes from
producer
Reads from consumer
Storm and Kafka – Broker Slowness
66. DiskJVM OS Page Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Storm and Kafka – Broker Slowness
U
N
U
S
E
D
Contents
swapped to disk
67. DiskJVM
OS
Page
Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Storm and Kafka – Broker Slowness
Maximize page
cache
U
N
U
S
E
D
68. DiskJVM OS Page Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Storm and Kafka – Broker Slowness
Contents swapped back
from disk
GC kicks in for swapped
out objects
69. DiskJVM OS Page Cache
Kafka Broker
§ broker code
§ read variables
§ filehandlers
§ writes from producers
§ metadata
§ partition information
§ Topic information
Storm and Kafka – Broker Slowness
Contents swapped back
from disk
GC kicks in for swapped
out objects
Writes
High RPS pipeline will see heavy backpressure
and data will get dropped
VM.Swapiness
70. Lessons Learned
70
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5