SlideShare ist ein Scribd-Unternehmen logo
1 von 71
Lessons Learned Building A Scalable
Self-serve, Real-time, Multi-tenant
Monitoring Service
PRESENTED BY Mridul Jain, Sumeet Singh⎪ March 31, 2016
S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 6 , S a n J o s e
Introduction
2
§  Big ML at Yahoo
§  Has used Storm and Kafka for real-time trend
analysis in search and central monitoring
§  Co-authored Pig on Storm
§  Co-authored CaffeOnSpark for distributed deep
learning
Mridul Jain
Senior Principal Architect
Big Data and Machine Learning
Science and Technology
701 First Avenue,
Sunnyvale, CA 94089 USA
@mridul_jain
§  Manages Hadoop products team at Yahoo
§  Responsible for Product Management, Strategy and
Customer Engagements
§  Managed Cloud Services products team and headed
strategy functions for the Cloud Platform Group at
Yahoo
§  MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Sr. Director, Product Management
Cloud and Big Data Platforms
Science and Technology
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
Acknowledgement
3
We want to acknowledge the contributions from Kapil Gupta and Arun Gupta,
Principal Architects with the Yahoo Monitoring team to this presentation as well
as the monitoring platform.
We would also like to thank the entire Yahoo Monitoring and Hadoop and
Big Data Platforms teams for making the next generation monitoring services
a reality at Yahoo.
Agenda
4
Overview1
Transitioning from Classical to Real-time Big Data Architecture
Lessons Learned Scaling the Real-time Big Data Stack
Lessons Learned Optimizing for System Performance
Q&A
2
3
4
5
Introduction to Yahoo’s Monitoring as a Service
5
...
...
Infra Monitoring
CPU, disk, network
Host uptime
HTTP sess. errors
Hosts
Apps
App Monitoring
Req. per second
Avg. latency
API access errors
Hosted Multi-tenant
Monitoring
Service
Collection
Storage
Scheduling
Coordination
Alerts /
Thresholds
Dashboards
Aggregation
Classical Architecture – Pre Real-time Big Data Tech
6
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Classical Architecture – Pre Real-time Big Data Tech
7
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Classical Architecture – Pre Real-time Big Data Tech
8
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2
Classical Architecture – Pre Real-time Big Data Tech
9
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2
Massive Query Federation3
Classical Architecture – Pre Real-time Big Data Tech
10
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2
Massive Query Federation3
✗ Manageability Challenges
Classical Architecture – Pre Real-time Big Data Tech
11
H1
H2
H3
H4
H5
Collector Aggregator
Server
DB Server
Dashboard
Classical Architecture – Pre Real-time Big Data Tech
12
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Manual partitioning of
hosts
1
Classical Architecture – Pre Real-time Big Data Tech
13
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Manual partitioning of
hosts
1 Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2
Classical Architecture – Pre Real-time Big Data Tech
14
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3
Classical Architecture – Pre Real-time Big Data Tech
15
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3 Seq. fetch for
federated queries
4
Classical Architecture – Pre Real-time Big Data Tech
16
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3 Seq. fetch for
federated queries
4
✗ Scale Challenges ✗ Availability Challenges
Architecture Based on Real-time Big Data Tech
17
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs
Architecture Based on Real-time Big Data Tech
18
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs
No manual partitioning / sharing
Built-in horizontal scalability
Built-in High-availability
✔ Manageability
✔ Scalability
✔ Availability
Standard Big Data Frameworks
Scale and Performance
19
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
Scale and Performance
20
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§  Low latency real-time processing
§  5x scale than the previous architecture
§  Massive parallelism and pipelining
§  Real-time aggregation, thresholds and alerts
§  Support for larger historic data lookup &
processing
§  Support for self-serve complex processing, data
slicing and dicing
§  Pluggable algo and ML models (e.g. EGADS)
Run semantic
and syntactic
validation CLI
Git commit, PR,
Merge
Scale and Performance
21
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
Git
CI / CD
A = Filter from *
where host regex …
/alert_policy/kpis.yaml
/contacts/oc.yaml
/rules/system.yo
Alerts to OC,
correlators and
mailing lists
Run semantic
and syntactic
validation CLI
Git commit, PR,
Merge
Scale and Performance
22
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
Git
CI / CD
A = Filter from *
where host regex …
/alert_policy/kpis.yaml
/contacts/oc.yaml
/rules/system.yo
✔ Self-serve Easy Deploys ✔ Real-time Alerting
Alerts to OC,
correlators and
mailing lists
Self Serve Rules
23
A	=	filter	*	where	namespace	==	“product1”	
and	application	==	“apache",60,3	
	
B	=	filter	*	where	namespace	==	“product2”	
and	Tag.host	in	(“hostgrp1”,”hostgrp4”)	
	
C	=	threshold	A	Metric.monstatus.latency	<	
2	as	"mycheck"	
	
Store	C	
	
alert	C	,	$LatencyAlertConfig,	
$NotificationID	,	LOW,	$UrlID,	
$CustMessageID	
§  Simple and rich processing language with custom UDF
support for algos and statistical functions
§  Support for arithmetic, set, stats operators, groupby,
joins etc.
§  Events from different namespaces can be combined
§  Thresholds and policies, notifications contact, severity
in a simple hot deployable fashion
§  Store relations and calculations as you like
§  Automatically track all the good, bad, and missing
events
Lessons Learned
24
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
Lessons Learned
25
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
Storm + Kafka Based Architecture
26
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Product N
133topics
Storm
Kafka
HTTP POST
Scale of an Online Monitoring Solution
27
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Product N
133topics
Storm
Kafka
HTTP POST
§  400 bolt tasks in 40
workers
TSDB_1
TSDB_2
TSDB_3
§  450 topologies
§  15 topics /topology
§  3 partitions /topic
§  3 TSDB topics
§  222 partitions per
topic
A Producer - Consumer Pipeline
28
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
A Producer - Consumer Pipeline
29
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§  Excellent E2E Synchronization
§  Provides a breather against individual component failures
§  Reasonably good performance inspite of transient failures
§  Can help individual components to scale, if used smartly
Monitoring Time Roll-ups
30
Topic in-mem state
Kafka Cluster
Spout Bolt
Storm
Topic in-mem state
Topic in-mem state
§  Huge in-memory state
§  220 million/min * 60
§  Trident issues
§  High network à high CPU
Monitoring Time Roll-ups
31
Topic in-mem state
Kafka Cluster
Spout
Storm
Topic in-mem state
Topic in-mem state
§  Aggregate in Spout
§  220 million/min * 60
§  Fields grouping in kafka for a time series
Producer
Kafka Refresh
32
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
Kafka Refresh
33
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§  A producer contacts any broker
to get the topic list across the
cluster every 10 mins
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Kafka Refresh
34
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§  A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§  For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Kafka Refresh
35
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§  A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§  For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
§  If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Kafka Refresh
36
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§  A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§  For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
§  If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
§  Effectively hangs the producer
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Kafka Refresh
37
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster
§  A producer contacts any broker
to get the topic list across the
cluster every 10 mins
§  For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
§  If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
§  Effectively hangs the producer
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Disable refresh
If broker is down anyway the
producer apis get it from an
alternate broker
A Producer - Consumer Pipeline
38
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§  Excellent E2E Synchronization
§  Provides a breather against individual component failures
§  Reasonably good performance inspite of transient failures
§  Can help individual components to scale, if used smartly
§  Queuing system is your last line of defense, choose wisely
Lessons Learned
39
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
Skewed Ingestion per Task
40
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
High rate of ingestion with a “Group By” on limited dimensions will direct all
events for a specific dimension to one task
Skewed Ingestion per Task
41
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
Overall state per task reduces due to combiners sharing the original big state and
also aggregating it before fwding to final bolts, thus reducing their overall state
Each of the combiners maintain local
state for each of the dimensions and
fwds the aggregated count to B1 or B2
com 1
com 2
com 3
Shuffle Partition By
Abuse
42
§  Max ingestion per TSDB - 120k/s
§  UID table hit hard due to high cardinality data
§  Lots of in-memory states created in Storm bolts
Lessons Learned
43
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
ZooKeeper Scaling
44
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
ZK - Storm
§  Kafka consumer swap in-out create heavy churn in ZK state for kafka brokers
§  Every time a consumer enter/leaves, all consumers query the group state from ZK
§  Same for rolling upgrade for kafka, restarts, any bad behaviour by consumers
ZK - Kafka
Single Cluster
for Agg.
Topology Scaling
45
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
Single Cluster
for Agg.
Trident Scaling
46
A = filter * where namespace == “ABC” and application == "XYZ",5,3
1 Rule 1 Logical Bolt
Trident accepts < 400 rules per topology : 400 logical Trident UDFs
§  zookeeper jute size
§  tunable but leads to performance issues : nimbus OOM, worker heartbeat slowness etc.
Eg: 1200 rules will need about 3 trident topologies
Efficient Resourcing and Hardware Utilization
47
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Cluster 1
UI
Dashboard
&
Graphs
Cluster 2
Rollup topology
- all tenants
System,
Abuse
topologies
Isolation
Re-queue Pipeline – Solution for Write Stability
48
Data Queue
6 Hrs
Requeue queue
24 Hrs
Kafka
Kafka
consumer
TSDB Async HBase lib HBase
UID Lookups
UID table unavailable
No response
NSRE
§  Region splits & hotspots
§  NSREs & GCs
§  Region unresponsive
§  Region unavailability
§  Load rebalancing
§  Region queue size max-
out
Lessons Learned
49
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
Auto Retries
50
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Auto Retries
51
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
success
Evict the written rpc from cache
Auto Retries
52
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
failed
Retry to write to HBase by looking
up the RPC in the cache
Auto Retries
53
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Failed/success
Given the additional job of handling the
removed / expired entry
Timed-out RPCs
Auto Retries
54
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
retry
Failed/success
Timed-out RPCs
Given the additional job of handling the
removed / expired entry
Put it back in cache
Auto Retries
55
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Given the additional job of
removing expired entry
retry
Failed/success
Stack Overflow!!
Timed-out RPCs
Auto Retries
56
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Overflow!!
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Response
✓ ✓
Timed-out RPCs
Auto Retries
57
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Response
No space in stack!!
Throws exception
✓ ✓
Timed-out RPCs
Auto Retries
58
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Lock
Response
No space in stack!!
Throws exception
✓ ✓
Timed-out RPCs
Auto Retries
59
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Lock
Response
No space in stack!!
Throws exception
Lock
✓ ✓
Timed-out RPCs
Auto Retries
60
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Response
As stack has unwinded to some extent,
we get space to call Unlock now
Lock
Lock
✓ ✓
Timed-out RPCs
Auto Retries
61
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Hangup !!
Thread
dies
Lock
Response
Lock
Lock
§  Thread is dead
§  3 locks remaining
§  No thread can write/insert as the cache is locked
§  Guava cache hung, TSDB hung!!
✓ ✓
Timed-out RPCs
Lessons Learned
62
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
Broker 3
Broker 1
Storm and Kafka – Broker Slowness
63
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Storm
Kafka
HTTP POST
§  bolt thread writes to in-mem
kafka queue async
§  during slowness of even one
broker if this queue fills up, it
blocks the producer bolt
thread, which in turn back
pressures upstream
TSDB_1
TSDB_2
§  133 topologies
§  15 topics per
topology
§  3 partitions per
topic
§  3 TSDB topics
§  222 partitions per
topic
§  22 Kafka brokers
§  If we have no
spooling we lose the
data even if broker
recovers, else
replay saves the day
Broker 2
Product2
Product 3
Broker 3
Broker 1
Storm and Kafka – Broker Slowness
64
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Storm
Kafka
HTTP POST
§  bolt thread writes to in-mem
kafka queue async
§  during slowness of even one
broker if this queue fills up, it
blocks the producer bolt
thread, which in turn back
pressures upstream
TSDB_1
TSDB_2
§  133 topologies
§  15 topics per
topology
§  3 partitions per
topic
§  3 TSDB topics
§  222 partitions per
topic
§  22 Kafka brokers
§  If we have no
spooling we lose the
data even if broker
recovers, else
replay saves the day
Broker 2
Product2
Product 3
✓ Better Monitoring
DiskJVM OS Page Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§  partition information
§  Topic information
Writes from
producer
Reads from consumer
Storm and Kafka – Broker Slowness
DiskJVM OS Page Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§  partition information
§  Topic information
Storm and Kafka – Broker Slowness
U
N
U
S
E
D
Contents
swapped to disk
DiskJVM
OS
Page
Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§  partition information
§  Topic information
Storm and Kafka – Broker Slowness
Maximize page
cache
U
N
U
S
E
D
DiskJVM OS Page Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§  partition information
§  Topic information
Storm and Kafka – Broker Slowness
Contents swapped back
from disk
GC kicks in for swapped
out objects
DiskJVM OS Page Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§  partition information
§  Topic information
Storm and Kafka – Broker Slowness
Contents swapped back
from disk
GC kicks in for swapped
out objects
Writes
High RPS pipeline will see heavy backpressure
and data will get dropped
VM.Swapiness
Lessons Learned
70
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5
Thank You
@mridul_jain
@sumeetksingh

Weitere ähnliche Inhalte

Was ist angesagt?

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryDataWorks Summit
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Uwe Printz
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 

Was ist angesagt? (20)

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 

Ähnlich wie Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo

C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jugGerald Muecke
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfAlbert Wong
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
 
Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachDatabricks
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolEDB
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Databricks
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 

Ähnlich wie Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo (20)

C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdf
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT Approach
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 

Mehr von Sumeet Singh

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckSumeet Singh
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

Mehr von Sumeet Singh (16)

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Kürzlich hochgeladen

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo

  • 1. Lessons Learned Building A Scalable Self-serve, Real-time, Multi-tenant Monitoring Service PRESENTED BY Mridul Jain, Sumeet Singh⎪ March 31, 2016 S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 6 , S a n J o s e
  • 2. Introduction 2 §  Big ML at Yahoo §  Has used Storm and Kafka for real-time trend analysis in search and central monitoring §  Co-authored Pig on Storm §  Co-authored CaffeOnSpark for distributed deep learning Mridul Jain Senior Principal Architect Big Data and Machine Learning Science and Technology 701 First Avenue, Sunnyvale, CA 94089 USA @mridul_jain §  Manages Hadoop products team at Yahoo §  Responsible for Product Management, Strategy and Customer Engagements §  Managed Cloud Services products team and headed strategy functions for the Cloud Platform Group at Yahoo §  MBA from UCLA and MS from Rensselaer Polytechnic Institute (RPI) Sumeet Singh Sr. Director, Product Management Cloud and Big Data Platforms Science and Technology 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh
  • 3. Acknowledgement 3 We want to acknowledge the contributions from Kapil Gupta and Arun Gupta, Principal Architects with the Yahoo Monitoring team to this presentation as well as the monitoring platform. We would also like to thank the entire Yahoo Monitoring and Hadoop and Big Data Platforms teams for making the next generation monitoring services a reality at Yahoo.
  • 4. Agenda 4 Overview1 Transitioning from Classical to Real-time Big Data Architecture Lessons Learned Scaling the Real-time Big Data Stack Lessons Learned Optimizing for System Performance Q&A 2 3 4 5
  • 5. Introduction to Yahoo’s Monitoring as a Service 5 ... ... Infra Monitoring CPU, disk, network Host uptime HTTP sess. errors Hosts Apps App Monitoring Req. per second Avg. latency API access errors Hosted Multi-tenant Monitoring Service Collection Storage Scheduling Coordination Alerts / Thresholds Dashboards Aggregation
  • 6. Classical Architecture – Pre Real-time Big Data Tech 6 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query
  • 7. Classical Architecture – Pre Real-time Big Data Tech 7 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query Large Fan-out1
  • 8. Classical Architecture – Pre Real-time Big Data Tech 8 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query Large Fan-out1 Manually Sharded DBs2
  • 9. Classical Architecture – Pre Real-time Big Data Tech 9 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query Large Fan-out1 Manually Sharded DBs2 Massive Query Federation3
  • 10. Classical Architecture – Pre Real-time Big Data Tech 10 Hosts 200,000 Aggregators 60 DB Shards 2,400 Collectors 43 Frontend / Query Large Fan-out1 Manually Sharded DBs2 Massive Query Federation3 ✗ Manageability Challenges
  • 11. Classical Architecture – Pre Real-time Big Data Tech 11 H1 H2 H3 H4 H5 Collector Aggregator Server DB Server Dashboard
  • 12. Classical Architecture – Pre Real-time Big Data Tech 12 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Manual partitioning of hosts 1
  • 13. Classical Architecture – Pre Real-time Big Data Tech 13 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Manual partitioning of hosts 1 Single threaded agg. /cluster Seq. processing of rules 4M DP/min per agg. 2
  • 14. Classical Architecture – Pre Real-time Big Data Tech 14 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Single threaded agg. /cluster Seq. processing of rules 4M DP/min per agg. 2Manual partitioning of hosts 1 1 shard / cluster 1.5M DP/min 3
  • 15. Classical Architecture – Pre Real-time Big Data Tech 15 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Single threaded agg. /cluster Seq. processing of rules 4M DP/min per agg. 2Manual partitioning of hosts 1 1 shard / cluster 1.5M DP/min 3 Seq. fetch for federated queries 4
  • 16. Classical Architecture – Pre Real-time Big Data Tech 16 H1 H2 H3 H4 H5 Collector Dashboard Aggregator Server DB Server A A A B B B Single threaded agg. /cluster Seq. processing of rules 4M DP/min per agg. 2Manual partitioning of hosts 1 1 shard / cluster 1.5M DP/min 3 Seq. fetch for federated queries 4 ✗ Scale Challenges ✗ Availability Challenges
  • 17. Architecture Based on Real-time Big Data Tech 17 Hosts Collectors Data Highway UI Dashboard & Graphs
  • 18. Architecture Based on Real-time Big Data Tech 18 Hosts Collectors Data Highway UI Dashboard & Graphs No manual partitioning / sharing Built-in horizontal scalability Built-in High-availability ✔ Manageability ✔ Scalability ✔ Availability Standard Big Data Frameworks
  • 19. Scale and Performance 19 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs
  • 20. Scale and Performance 20 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs §  Low latency real-time processing §  5x scale than the previous architecture §  Massive parallelism and pipelining §  Real-time aggregation, thresholds and alerts §  Support for larger historic data lookup & processing §  Support for self-serve complex processing, data slicing and dicing §  Pluggable algo and ML models (e.g. EGADS)
  • 21. Run semantic and syntactic validation CLI Git commit, PR, Merge Scale and Performance 21 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs Git CI / CD A = Filter from * where host regex … /alert_policy/kpis.yaml /contacts/oc.yaml /rules/system.yo Alerts to OC, correlators and mailing lists
  • 22. Run semantic and syntactic validation CLI Git commit, PR, Merge Scale and Performance 22 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs Git CI / CD A = Filter from * where host regex … /alert_policy/kpis.yaml /contacts/oc.yaml /rules/system.yo ✔ Self-serve Easy Deploys ✔ Real-time Alerting Alerts to OC, correlators and mailing lists
  • 23. Self Serve Rules 23 A = filter * where namespace == “product1” and application == “apache",60,3 B = filter * where namespace == “product2” and Tag.host in (“hostgrp1”,”hostgrp4”) C = threshold A Metric.monstatus.latency < 2 as "mycheck" Store C alert C , $LatencyAlertConfig, $NotificationID , LOW, $UrlID, $CustMessageID §  Simple and rich processing language with custom UDF support for algos and statistical functions §  Support for arithmetic, set, stats operators, groupby, joins etc. §  Events from different namespaces can be combined §  Thresholds and policies, notifications contact, severity in a simple hot deployable fashion §  Store relations and calculations as you like §  Automatically track all the good, bad, and missing events
  • 24. Lessons Learned 24 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  • 25. Lessons Learned 25 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  • 26. Storm + Kafka Based Architecture 26 Central Collector (no spooling) Spout with Jetty Servlet Bolt Product1 Product 2 Product N 133topics Storm Kafka HTTP POST
  • 27. Scale of an Online Monitoring Solution 27 Central Collector (no spooling) Spout with Jetty Servlet Bolt Product1 Product 2 Product N 133topics Storm Kafka HTTP POST §  400 bolt tasks in 40 workers TSDB_1 TSDB_2 TSDB_3 §  450 topologies §  15 topics /topology §  3 partitions /topic §  3 TSDB topics §  222 partitions per topic
  • 28. A Producer - Consumer Pipeline 28 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs
  • 29. A Producer - Consumer Pipeline 29 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs §  Excellent E2E Synchronization §  Provides a breather against individual component failures §  Reasonably good performance inspite of transient failures §  Can help individual components to scale, if used smartly
  • 30. Monitoring Time Roll-ups 30 Topic in-mem state Kafka Cluster Spout Bolt Storm Topic in-mem state Topic in-mem state §  Huge in-memory state §  220 million/min * 60 §  Trident issues §  High network à high CPU
  • 31. Monitoring Time Roll-ups 31 Topic in-mem state Kafka Cluster Spout Storm Topic in-mem state Topic in-mem state §  Aggregate in Spout §  220 million/min * 60 §  Fields grouping in kafka for a time series Producer
  • 32. Kafka Refresh 32 Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster
  • 33. Kafka Refresh 33 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3
  • 34. Kafka Refresh 34 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins §  For each topic fetch call there is a timeout of 10 secs which is a blocking call on main producer thread Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3
  • 35. Kafka Refresh 35 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins §  For each topic fetch call there is a timeout of 10 secs which is a blocking call on main producer thread §  If there are 100 topics and a broker is down(sock time out), this gets blocked for 1000s > next refresh cycle (10mins) Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3
  • 36. Kafka Refresh 36 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins §  For each topic fetch call there is a timeout of 10 secs which is a blocking call on main producer thread §  If there are 100 topics and a broker is down(sock time out), this gets blocked for 1000s > next refresh cycle (10mins) §  Effectively hangs the producer Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3
  • 37. Kafka Refresh 37 Each of the brokers may have different topics, but each of them have metadata about every other broker in the cluster §  A producer contacts any broker to get the topic list across the cluster every 10 mins §  For each topic fetch call there is a timeout of 10 secs which is a blocking call on main producer thread §  If there are 100 topics and a broker is down(sock time out), this gets blocked for 1000s > next refresh cycle (10mins) §  Effectively hangs the producer Broker 2 Broker 3 Broker 1 topic 1 topic 2 topic 4 topic 5 topic 6 Kafka topic 3 Disable refresh If broker is down anyway the producer apis get it from an alternate broker
  • 38. A Producer - Consumer Pipeline 38 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs §  Excellent E2E Synchronization §  Provides a breather against individual component failures §  Reasonably good performance inspite of transient failures §  Can help individual components to scale, if used smartly §  Queuing system is your last line of defense, choose wisely
  • 39. Lessons Learned 39 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  • 40. Skewed Ingestion per Task 40 Spout bolt A1 bolt A2 bolt A3 bolt B1 bolt B2 22 M / min High rate of ingestion with a “Group By” on limited dimensions will direct all events for a specific dimension to one task
  • 41. Skewed Ingestion per Task 41 Spout bolt A1 bolt A2 bolt A3 bolt B1 bolt B2 22 M / min Overall state per task reduces due to combiners sharing the original big state and also aggregating it before fwding to final bolts, thus reducing their overall state Each of the combiners maintain local state for each of the dimensions and fwds the aggregated count to B1 or B2 com 1 com 2 com 3 Shuffle Partition By
  • 42. Abuse 42 §  Max ingestion per TSDB - 120k/s §  UID table hit hard due to high cardinality data §  Lots of in-memory states created in Storm bolts
  • 43. Lessons Learned 43 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  • 44. ZooKeeper Scaling 44 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs ZK - Storm §  Kafka consumer swap in-out create heavy churn in ZK state for kafka brokers §  Every time a consumer enter/leaves, all consumers query the group state from ZK §  Same for rolling upgrade for kafka, restarts, any bad behaviour by consumers ZK - Kafka Single Cluster for Agg.
  • 45. Topology Scaling 45 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Aggregation Topologies UI Dashboard & Graphs Single Cluster for Agg.
  • 46. Trident Scaling 46 A = filter * where namespace == “ABC” and application == "XYZ",5,3 1 Rule 1 Logical Bolt Trident accepts < 400 rules per topology : 400 logical Trident UDFs §  zookeeper jute size §  tunable but leads to performance issues : nimbus OOM, worker heartbeat slowness etc. Eg: 1200 rules will need about 3 trident topologies
  • 47. Efficient Resourcing and Hardware Utilization 47 Data Highway Data Ingest Topology Tenant 1 Tenant 2 Tenant 3 Tenant 1 Tenant 2 Tenant 3 Topics Tenant 1 Tenant 2 Tenant 3 Cluster 1 UI Dashboard & Graphs Cluster 2 Rollup topology - all tenants System, Abuse topologies Isolation
  • 48. Re-queue Pipeline – Solution for Write Stability 48 Data Queue 6 Hrs Requeue queue 24 Hrs Kafka Kafka consumer TSDB Async HBase lib HBase UID Lookups UID table unavailable No response NSRE §  Region splits & hotspots §  NSREs & GCs §  Region unresponsive §  Region unavailability §  Load rebalancing §  Region queue size max- out
  • 49. Lessons Learned 49 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  • 50. Auto Retries 50 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts
  • 51. Auto Retries 51 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool success Evict the written rpc from cache
  • 52. Auto Retries 52 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool failed Retry to write to HBase by looking up the RPC in the cache
  • 53. Auto Retries 53 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool Callback Failed/success Given the additional job of handling the removed / expired entry Timed-out RPCs
  • 54. Auto Retries 54 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool Callback retry Failed/success Timed-out RPCs Given the additional job of handling the removed / expired entry Put it back in cache
  • 55. Auto Retries 55 HBase Guava Cache (Inflight RPC queue) Writer Thread Pool Inserts Netty Thread Pool Callback Given the additional job of removing expired entry retry Failed/success Stack Overflow!! Timed-out RPCs
  • 56. Auto Retries 56 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Overflow!! Lock Lock Lock Lock Lock Lock Lock Lock Lock Response ✓ ✓ Timed-out RPCs
  • 57. Auto Retries 57 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Unwind Lock Lock Lock Lock Lock Lock Lock Lock Unlock Response No space in stack!! Throws exception ✓ ✓ Timed-out RPCs
  • 58. Auto Retries 58 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Unwind Lock Lock Lock Lock Lock Lock Lock Unlock Lock Response No space in stack!! Throws exception ✓ ✓ Timed-out RPCs
  • 59. Auto Retries 59 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Unwind Lock Lock Lock Lock Lock Lock Unlock Lock Response No space in stack!! Throws exception Lock ✓ ✓ Timed-out RPCs
  • 60. Auto Retries 60 HBase Writer Thread Pool Inserts Netty Thread Pool Stack Unwind Lock Lock Lock Lock Lock Lock Response As stack has unwinded to some extent, we get space to call Unlock now Lock Lock ✓ ✓ Timed-out RPCs
  • 61. Auto Retries 61 HBase Writer Thread Pool Inserts Netty Thread Pool Hangup !! Thread dies Lock Response Lock Lock §  Thread is dead §  3 locks remaining §  No thread can write/insert as the cache is locked §  Guava cache hung, TSDB hung!! ✓ ✓ Timed-out RPCs
  • 62. Lessons Learned 62 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5
  • 63. Broker 3 Broker 1 Storm and Kafka – Broker Slowness 63 Central Collector (no spooling) Spout with Jetty Servlet Bolt Product1 Product 2 Storm Kafka HTTP POST §  bolt thread writes to in-mem kafka queue async §  during slowness of even one broker if this queue fills up, it blocks the producer bolt thread, which in turn back pressures upstream TSDB_1 TSDB_2 §  133 topologies §  15 topics per topology §  3 partitions per topic §  3 TSDB topics §  222 partitions per topic §  22 Kafka brokers §  If we have no spooling we lose the data even if broker recovers, else replay saves the day Broker 2 Product2 Product 3
  • 64. Broker 3 Broker 1 Storm and Kafka – Broker Slowness 64 Central Collector (no spooling) Spout with Jetty Servlet Bolt Product1 Product 2 Storm Kafka HTTP POST §  bolt thread writes to in-mem kafka queue async §  during slowness of even one broker if this queue fills up, it blocks the producer bolt thread, which in turn back pressures upstream TSDB_1 TSDB_2 §  133 topologies §  15 topics per topology §  3 partitions per topic §  3 TSDB topics §  222 partitions per topic §  22 Kafka brokers §  If we have no spooling we lose the data even if broker recovers, else replay saves the day Broker 2 Product2 Product 3 ✓ Better Monitoring
  • 65. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Writes from producer Reads from consumer Storm and Kafka – Broker Slowness
  • 66. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Storm and Kafka – Broker Slowness U N U S E D Contents swapped to disk
  • 67. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Storm and Kafka – Broker Slowness Maximize page cache U N U S E D
  • 68. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Storm and Kafka – Broker Slowness Contents swapped back from disk GC kicks in for swapped out objects
  • 69. DiskJVM OS Page Cache Kafka Broker §  broker code §  read variables §  filehandlers §  writes from producers §  metadata §  partition information §  Topic information Storm and Kafka – Broker Slowness Contents swapped back from disk GC kicks in for swapped out objects Writes High RPS pipeline will see heavy backpressure and data will get dropped VM.Swapiness
  • 70. Lessons Learned 70 Producer-consumer problem at scale requires the right balance in architecture1 Skewness in data is hard to debug E2E multi-tenancy and resourcing should be handled strategically Optimizations made in async systems are hard to debug Do not neglect the assumptions/optimizations outside your application 2 3 4 5