A Billion Points of Data Pressure

©2016 Pepperdata
Sean Suchter
CEO & Co-founder, Pepperdata
A Billion Points of Data
PressureBest Practices to Process, Store, and Visualize
Cluster Activity Data at Scale

©2016 Pepperdata
AGENDA
• The deluge of data metrics
• Managing data pressure
• Architecture
• Scaling and optimizing data writes
• Optimizing queries
• Short-lived time series — performance and math challenges
• Performance stats vs. standard OpenTSDB deployments

©2016 Pepperdata
EVEN A FEW NODES GENERATE MANY METRICS
Example cluster:
• 4,000 nodes
• Each running 40 tasks
• Each task has 200 different metrics (memory consumed, HDFS reads, CPU time, etc.)
Hadoop itself is a great example of a huge-scale microservice architecture!
If we sample every metric every 5 seconds, we generate about
400 million data points every minute (500 billion per day)
node 1
task 1
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 2
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 3
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 4
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 5
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
… task P
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
node 2
task 1
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 2
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 3
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 4
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 5
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
… task Q
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
node M
task 1
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 2
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 3
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 4
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
task 5
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
… task R
metric 1
metric 2
metric 3
metric 4
metric 5
...
metric N
…

©2016 Pepperdata
DATA PRESSURE IS LIKE WATER PRESSURE
• System over max capacity? You will get
leaks!
• Reinforcing one component adds
pressure elsewhere — EVERY
component must be stable.
• Increase capacity or change behavior on
one component and you WILL cause
ripples throughout the system.
Imagine the plumbing in your house:

©2016 Pepperdata
TYPICAL OPENTSDB DEPLOYMENT
http://opentsdb.net/overview.html

©2016 Pepperdata
PEPPERDATA DASHBOARD ARCHITECTURE
Node1 Node 2 Node N…
Hadoop cluster
DashboardServices
Insertion
TSD
Insertion
TSD
…
Node-level
aggregation
Node-level
aggregation
…
HBase
Query TSD Query TSD…
Servlet
Browser
(Javascript)
Browser
(Javascript)
Global
aggregation

©2016 Pepperdata
• It’s important to separate read TSDs and write TSDs because of
different thread access patterns.
SEPARATE TSDS

©2016 Pepperdata
HOW MANY TSDS?
• To handle more data, you can add TSDs, but that increases
pressure on the HBase and HDFS.
• Solution: scale HBase and HDFS too.
• But you will have problems until your data gets well split or if there
are spikes on particular nodes.
• Solution: If TSD/HBase/HDFS gets overloaded, buffer up data on
the sender side until it can be sent.

©2016 Pepperdata
REDUCE SERIALIZATION OVERHEAD
• Serialization of the input data into TSD is a CPU bottleneck.
• Solution, part 1: Bulk put — insert many points as one operation.
• Solution, part 2: OpenTSDB plugin — move metrics processing
from collector into the TSD process.

©2016 Pepperdata
SEGMENT THE QUERY
• Typically one query gets handled by a read TSD. But then that
TSD runs out of memory servicing the query.
• Solution: Segment the query, then put it back together at the
servlet layer.
• But now the servlet layer has to hold the put-together query data.
• Solution: Two-phase queries —pick the top-N series to show in one
pass, then redo the query with the actual content.

©2016 Pepperdata
PRE-COMPUTE COMMON AGGREGATES
• The raw data is too much data to read off the disk in HBase.
• Solution: Pre-compute common aggregates before inserting one
node’s data to TSD.
• The pre-aggregates can even be too much on a large cluster.
• Solution: Globally aggregate once enough node data is inserted.
• Note: Tricky because senders can have buffered data; need to handle this
case right.

©2016 Pepperdata
CACHING
• Queries are frequently over the same data.
• Solution: Cache the computed response in the TSD on disk.
• Need to make sure you invalidate this at the right moments — can be
tricky since different nodes and metrics may be ingested at different
times.

©2016 Pepperdata
SHORT-LIVED TIME SERIES: STORAGE
CHALLENGE
• Container time series start and stop a lot. (Millions of times per
hour.) This can be inefficient for storage.
• Solution: Pick tag schema very carefully.

©2016 Pepperdata
SHORT-LIVED TIME SERIES: MATH CHALLENGES
• Short-lived time series break typical OpenTSDB math:
• Standard OpenTSDB case of node-level time series can have
discontinuities around start and stop of nodes and that’s fine,
since you only see a few of those starts and stops in any given
plot.
• With containers, you’re likely to get 100’s of thousands or
millions of discontinuities on every plot!
• Aggregation, downsampling, and rate metrics need to be
computed very carefully.
• Solution: Completely rewrite math layers to take this into account.

©2016 Pepperdata
SOME PERFORMANCE STATS
• One example production deployment:
• We can render plots of a full day of this data in ~1 second.
4,000
nodes
40
tasks/no
de
200
metrics/
task
5-
second
sampling
400
million
points/m
inute

©2016 Pepperdata
PERFORMANCE VS. STANDARD OPENTSDB
http://www.slideshare.net/HBaseCon/ecosystem-session-6
http://www.slideshare.net/HBaseCon/operations-session-3-49043534
- 50,000 100,000 150,000 200,000 250,000
Arista
OVH
Limelight
Pinterest
Box
Ticketmaster
Yahoo
Pepperdata
datapoints/second per machine

©2016 Pepperdata
PERFORMANCE VS. STANDARD OPENTSDB
http://opentsdb.net/misc/opentsdb-oscon.pdf
We process and store
~600 Billion data
points per day from a
single Hadoop cluster

A Billion Points of Data Pressure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to A Billion Points of Data Pressure

Similar to A Billion Points of Data Pressure (20)

Recently uploaded

Recently uploaded (20)

A Billion Points of Data Pressure

Editor's Notes