SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Scaling Pinterest’s Monitoring
1
Brian Overstreet - Visibility Software Engineer
Monitorama Agenda
• What is Pinterest?
• Starting from Scratch
• Scaling the Monitoring System
• Focused on time series metrics
• Challenges faced
• The Missing Element
• Lessons Learned
• Summary
Scaling Pinterest’s Monitoring
2
75+ Billion Ideas
categorized by people into more than

1 Billion Boards
3
4
Pinterest Unique VisitorsVisitors(millions)
0
10
20
30
40
Jan 2011 Apr 2011 Jul 2011 Oct 2011 Jan 2012 Apr 2012 Jul 2012 Jan 2013
Source: comscore
Tools
• Ganglia (system metrics)
• No application metrics
• Up/Down Checks
Early 2012
5
From Bad to Worse
Lots of Outages
6
Monitoring* Timeline
Time Series Tools
7
Pinterest
Launched
Graphite
Deployed
Ganglia
for
system
metrics
2010 20122011 2013 2014 2015 2016
*The action of observing and checking the behavior and outputs of a system and its components over time.
First Graphite Architecture
Single Box — Early 2012
8
Application
graphite-web
carbon-cache
statsd-server
Metrics Box
statsd UDP protocol
First Graphite Architecture
Single Box — Early 2012
9
Application
carbon-cache
statsd-server
Metrics Box
statsd UDP protocol
graphite-web
Second Graphite Architecture
Clustered — Early 2013
10
Application
haproxystatsd server
carbon-relay
carbon-relay
carbon-cache * 4
graphite-web
carbon-cache * 4
graphite-web
carbon-cache * 4
graphite-web
haproxy
graphite-web
Second Graphite Architecture
Clustered — Early 2013
11
Application
haproxystatsd server
carbon-relay
carbon-relay
carbon-cache * 4
graphite-web
carbon-cache * 4
graphite-web
carbon-cache * 4
graphite-web
haproxy
graphite-web
Option #1: Put StatsD Everywhere
• Pros
• Fixed packet loss
• Unique metric names per host
• Cons
• Unique metric names per host
• Latency only calculated per host
statsd for everyone
12
statsd
application
statsd
application
statsd
application
haproxy
carbon-relay
carbon-relay
Option #2: Sharded Statsd
• Pros
• Metric name not needed to be
unique by host
• Fixed most packet loss issues for
some time
• Cons
• Shard mapping in client
• Some statsd servers still would
have packet loss
• Shard mapping updating
statsd for different names
13
application
haproxy
carbon-relay
carbon-relay
application
application
statsd
statsd
statsd
metric.a
metric.b
metric.c
Multiple Graphite Clusters
everybody gets a cluster (mid 2013)
14
Application (python)
Statsd Servers (python)
Graphite Cluster (Java app)
Application (java)
Statsd Servers (java)
Graphite Cluster (Python app)
User Quote
• “Graphite isn't powerful enough to handle two globs in a request, so
‘obelix.pin.prod.*.*.metrics.coll.p99’ doesn't return anything most of the time.
With just one glob it usually works, but it can be very slow.”
on querying metrics in Graphite
15
Monitoring* Timeline
Time Series Tools
16
Pinterest
Launched
Graphite
Deployed
Ganglia
for
system
metrics
2010 20122011 2013 2014 2015 2016
OpenTSDB
Deployed
*The action of observing and checking the behavior and outputs of a system and its components over time.
User Quote
• “… convinced me to try out OpenTSDB, and I am VERY GLAD they did. The
interface isn't perfect, but it does let you construct queries quickly, and the data
is all there, easy to slice by tag and *fast*. I couldn't be happier, and it has saved
me hours of frustration and confusion over the last few days while tracking down
latency issues in our search clusters.”
on using OpenTSDB
17
Statsd still broken
never fixed real issue
18
Graphs are Just Wrong
too many metrics dropped
19
User Quotes
• “At this point I would give just about anything for a time-series database that I
could trust. The numbers coming out of graphite from the client and server sides
don't match, and neither of them match with the ganglia numbers.”
• “I don't know which to trust; even the shapes are different, so I'm no longer
convinced that the relative changes are right. That makes it hard for me to tell if
my theories are wrong, or the numbers are wrong, or both.”
on time series metrics
20
Replace Statsd Server
• Local metrics-agent
• Kafka
• Storm
by adding 3 new components
21
Metrics-agent
• Gatekeeper for time series data
• Interface for OpenTSDB and StatsD
• different ports
• Sends metrics to Kafka
• Needed to convert o Kafka pipeline with no downtime
• Double write to existing StatsD servers and Kafka
everybody gets an agent
22
New Metrics Pipeline
lambda architecture (2015)
23
Kafka
Storm
Batch Job
metrics-agent
application
metrics-agent
application
metrics-agent
application
graphite cluster 1
graphite cluster 2
opentsdb cluster 1
opentsdb cluster 2
Fixed Graphs
no more packet loss
24
Current Write Throughput
• Graphite
•120,000 points/second
• OpenTSDB
• 1.5 million points/second
Graphite and OpenTSDB
25
Statsboard
• Integrates Graphite, Ganglia,
OpenTSDB metrics
• Adds Graphite like functions to
OpenTSDB
• asPercent
• diffSeries
• integral
• sumSeries
• etc.
Time Series Dashboards and Alerts
26
Statsboard Config
• Dashboards
- "Outbound QPS Metrics":
- title: "Outbound QPS (by %)"
metrics:
- stat: metric_name_1
• Alerts
Alert Name:
threshold: metric > 100
pagerduty: service_name
Yet Another YAML Config Format
27
The Missing Element
The users
28
User Quotes on Graphite
• “I'm not saying Graphite isn't evil. It's evil. I'm just saying that if you spend a
week staring at it hard enough you can make some sense out of the madness :)”
• “I do not believe graphite is 'evil' since this is how RRD datasets have worked
since 1999.”
• “I don't think anyone is complaining about rrdtool, which is as much at fault for
Graphite as the Linux OS on which it runs. The problem is that you have to know
a lot of things to get correct results from a Graphite plot, and none of those
things are easy to find out (as John says, none of them appear on the data
plot).”
Graphite is Evil?
29
What about OpenTSDB?
I thought users were happy.
30
OpenTSDB Aggregation
• “Something is wrong with
OpenTSDB. My lines are often
unnaturally straight. Can you fix it?”
What exactly is getting aggregated?
31
Graphite User Education
• What RRDs are and how to normalize across intervals
• Metric summarization into next interval
• Getting requests/second from a timer
• Difference between stats and stats_counts
• Should I use hitcount or integral to calculate totals?
Train Users on System
32
OpenTSDB User Education
• Getting data from continually incrementing counters
• Interpolation of data points
• How aggregation works
• Query Optimization
Train Users on System — OpenTSDB
33
What else have we learned?
Besides system architecture and doing user education
34
Protect System from Clients
• Alert on unique metrics
• Block metrics using Zookeeper
Must control incoming metrics
35
metrics-agent
application
opentsdb
zookeeper
counts by common prefix
Alert on Prefix Count
on-call engineer
prefix block list
Trusting the Data
• Cannot control how users use the data
• Do not want business decisions off of wrong data
• Measuring data accuracy is hard
• Count metrics generated vs. metrics written at every phase.
• Lots of places a metric can get lost and not known that it was lost
Need to measure data points lost
36
Lessen Aggregator Overhead
• StatsD performs network call to update
a metric
• Manually tune sample rate to lessen overhead
(time consuming)
• Java uses Ostrich library for in process
aggregation
Ideally In Process
37
metrics-agent
Java Application
Ostrich
metrics-agent
StatsD Client
Lessen Operational Overhead
• More tools, more overhead
• Adding boxes to Graphite is hard
• Adding boxes to OpenTSDB is easy
• More monitoring systems, more monitoring of the monitoring system
• Removing a tool in production is hard
• Ganglia, Graphite, and OpenTSDB all still running
• As product gets more 9s so must the monitoring tools.
Fewer Tools?
38
Set User Expectations
• Data has a lifetime
• Unless otherwise conveyed, most users expect data to exist indefinitely.
• Not magical data warehouse tools that return data instantly
• Not all metrics will be efficient
I didn’t expect this talk to go on so long
39
Summary
• Match the monitoring system to where the company is at
• User education is key to scale these tools organizationally
• Tools scale with number of engineers not users of site
Thanks for listening
40

Weitere ähnliche Inhalte

Was ist angesagt?

Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Institute e-Austria Timisoara
 
The devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleThe devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleJulien Pivotto
 
Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016Matt Tesauro
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)Siglos
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportQAware GmbH
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
 
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Ververica
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureMarco Parenzan
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyBig Data Spain
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiSlim Baltagi
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyKrzysztof Adamski
 
Solving the Hidden Costs of Kubernetes with Observability
Solving the Hidden Costs of Kubernetes with ObservabilitySolving the Hidden Costs of Kubernetes with Observability
Solving the Hidden Costs of Kubernetes with ObservabilityDevOps.com
 
WJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next levelWJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next levelFrank Pfleger
 
How to Develop and Simulate Models with No Coding Experience
How to Develop and Simulate Models with No Coding ExperienceHow to Develop and Simulate Models with No Coding Experience
How to Develop and Simulate Models with No Coding ExperienceElizabeth Steiner
 
How Do We Better Sell DevOps? - PuppetConf 2013
How Do We Better Sell DevOps? - PuppetConf 2013How Do We Better Sell DevOps? - PuppetConf 2013
How Do We Better Sell DevOps? - PuppetConf 2013Puppet
 
AWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxAWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxSignalFx
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...rschuppe
 
T-Mobile and Elastic
T-Mobile and ElasticT-Mobile and Elastic
T-Mobile and ElasticElasticsearch
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPDataWorks Summit
 

Was ist angesagt? (20)

Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
 
The devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleThe devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code Style
 
Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and Azure
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim Baltagi
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud ready
 
Solving the Hidden Costs of Kubernetes with Observability
Solving the Hidden Costs of Kubernetes with ObservabilitySolving the Hidden Costs of Kubernetes with Observability
Solving the Hidden Costs of Kubernetes with Observability
 
WJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next levelWJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next level
 
How to Develop and Simulate Models with No Coding Experience
How to Develop and Simulate Models with No Coding ExperienceHow to Develop and Simulate Models with No Coding Experience
How to Develop and Simulate Models with No Coding Experience
 
How Do We Better Sell DevOps? - PuppetConf 2013
How Do We Better Sell DevOps? - PuppetConf 2013How Do We Better Sell DevOps? - PuppetConf 2013
How Do We Better Sell DevOps? - PuppetConf 2013
 
AWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxAWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFx
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
 
T-Mobile and Elastic
T-Mobile and ElasticT-Mobile and Elastic
T-Mobile and Elastic
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
 

Andere mochten auch

Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessAdrian Cockcroft
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Brian Brazil
 
Production testing through monitoring
Production testing through monitoringProduction testing through monitoring
Production testing through monitoringLeon Fayer
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a ServiceJames Turnbull
 
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleChris Jackson
 
Bruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Appsbrucelawson
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Adrian Cockcroft
 
A Billion Points of Data Pressure
A Billion Points of Data PressureA Billion Points of Data Pressure
A Billion Points of Data PressureBecky Mendenhall
 
StatsD Workshop Monitorama 2013
StatsD Workshop Monitorama 2013StatsD Workshop Monitorama 2013
StatsD Workshop Monitorama 2013Daniel Schauenberg
 
Statsd backends presentation
Statsd backends presentationStatsd backends presentation
Statsd backends presentationDraco2002
 
Velocity building a performance lab for mobile apps in a day - final
Velocity   building a performance lab for mobile apps in a day - finalVelocity   building a performance lab for mobile apps in a day - final
Velocity building a performance lab for mobile apps in a day - finalAshray Mathur
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
 
Continuous Delivery: Making DevOps Awesome
Continuous Delivery: Making DevOps AwesomeContinuous Delivery: Making DevOps Awesome
Continuous Delivery: Making DevOps AwesomeNicole Forsgren
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase HBaseCon
 
Scalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter FirehoseScalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter FirehoseLorenzo Alberton
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 

Andere mochten auch (20)

Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - Monitoringless
 
2016 metrics-as-culture
2016 metrics-as-culture2016 metrics-as-culture
2016 metrics-as-culture
 
Statistics for Engineers
Statistics for EngineersStatistics for Engineers
Statistics for Engineers
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
 
Production testing through monitoring
Production testing through monitoringProduction testing through monitoring
Production testing through monitoring
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a Service
 
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider ScaleDevOpsDays Amsterdam - Monitoring at Service Provider Scale
DevOpsDays Amsterdam - Monitoring at Service Provider Scale
 
Bruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of AppsBruce Lawson: Progressive Web Apps: the future of Apps
Bruce Lawson: Progressive Web Apps: the future of Apps
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
A Billion Points of Data Pressure
A Billion Points of Data PressureA Billion Points of Data Pressure
A Billion Points of Data Pressure
 
StatsD Workshop Monitorama 2013
StatsD Workshop Monitorama 2013StatsD Workshop Monitorama 2013
StatsD Workshop Monitorama 2013
 
Statsd backends presentation
Statsd backends presentationStatsd backends presentation
Statsd backends presentation
 
Velocity building a performance lab for mobile apps in a day - final
Velocity   building a performance lab for mobile apps in a day - finalVelocity   building a performance lab for mobile apps in a day - final
Velocity building a performance lab for mobile apps in a day - final
 
How to Speak "Manager"
How to Speak "Manager"How to Speak "Manager"
How to Speak "Manager"
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
Continuous Delivery: Making DevOps Awesome
Continuous Delivery: Making DevOps AwesomeContinuous Delivery: Making DevOps Awesome
Continuous Delivery: Making DevOps Awesome
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
Scalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter FirehoseScalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter Firehose
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 

Ähnlich wie Scaling Pinterest's Monitoring Systems

Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathDevopsdays
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013Nick Galbreath
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Industry Keynote at Large Scale Testing Workshop 2015
Industry Keynote at Large Scale Testing Workshop 2015Industry Keynote at Large Scale Testing Workshop 2015
Industry Keynote at Large Scale Testing Workshop 2015Wolfgang Gottesheim
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Store, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged ConferenceStore, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged ConferenceAni Lopez
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...LDBC council
 
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Josh Levy-Kramer
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
 
Metrics-Based Process Mapping
Metrics-Based Process MappingMetrics-Based Process Mapping
Metrics-Based Process MappingTKMG, Inc.
 
linkTuner Webinar - March 2013
linkTuner Webinar - March 2013linkTuner Webinar - March 2013
linkTuner Webinar - March 2013Fishbowl Solutions
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQLDaniel Austin
 
StasD & Graphite - Measure anything, Measure Everything
StasD & Graphite - Measure anything, Measure EverythingStasD & Graphite - Measure anything, Measure Everything
StasD & Graphite - Measure anything, Measure EverythingAvi Revivo
 

Ähnlich wie Scaling Pinterest's Monitoring Systems (20)

Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick Gallbreath
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Industry Keynote at Large Scale Testing Workshop 2015
Industry Keynote at Large Scale Testing Workshop 2015Industry Keynote at Large Scale Testing Workshop 2015
Industry Keynote at Large Scale Testing Workshop 2015
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Store, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged ConferenceStore, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged Conference
 
Monitoring your API
Monitoring your APIMonitoring your API
Monitoring your API
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
 
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Metrics-Based Process Mapping
Metrics-Based Process MappingMetrics-Based Process Mapping
Metrics-Based Process Mapping
 
linkTuner Webinar - March 2013
linkTuner Webinar - March 2013linkTuner Webinar - March 2013
linkTuner Webinar - March 2013
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQL
 
StasD & Graphite - Measure anything, Measure Everything
StasD & Graphite - Measure anything, Measure EverythingStasD & Graphite - Measure anything, Measure Everything
StasD & Graphite - Measure anything, Measure Everything
 

Kürzlich hochgeladen

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Kürzlich hochgeladen (20)

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

Scaling Pinterest's Monitoring Systems

  • 1. Scaling Pinterest’s Monitoring 1 Brian Overstreet - Visibility Software Engineer
  • 2. Monitorama Agenda • What is Pinterest? • Starting from Scratch • Scaling the Monitoring System • Focused on time series metrics • Challenges faced • The Missing Element • Lessons Learned • Summary Scaling Pinterest’s Monitoring 2
  • 3. 75+ Billion Ideas categorized by people into more than
 1 Billion Boards 3
  • 4. 4 Pinterest Unique VisitorsVisitors(millions) 0 10 20 30 40 Jan 2011 Apr 2011 Jul 2011 Oct 2011 Jan 2012 Apr 2012 Jul 2012 Jan 2013 Source: comscore
  • 5. Tools • Ganglia (system metrics) • No application metrics • Up/Down Checks Early 2012 5
  • 6. From Bad to Worse Lots of Outages 6
  • 7. Monitoring* Timeline Time Series Tools 7 Pinterest Launched Graphite Deployed Ganglia for system metrics 2010 20122011 2013 2014 2015 2016 *The action of observing and checking the behavior and outputs of a system and its components over time.
  • 8. First Graphite Architecture Single Box — Early 2012 8 Application graphite-web carbon-cache statsd-server Metrics Box statsd UDP protocol
  • 9. First Graphite Architecture Single Box — Early 2012 9 Application carbon-cache statsd-server Metrics Box statsd UDP protocol graphite-web
  • 10. Second Graphite Architecture Clustered — Early 2013 10 Application haproxystatsd server carbon-relay carbon-relay carbon-cache * 4 graphite-web carbon-cache * 4 graphite-web carbon-cache * 4 graphite-web haproxy graphite-web
  • 11. Second Graphite Architecture Clustered — Early 2013 11 Application haproxystatsd server carbon-relay carbon-relay carbon-cache * 4 graphite-web carbon-cache * 4 graphite-web carbon-cache * 4 graphite-web haproxy graphite-web
  • 12. Option #1: Put StatsD Everywhere • Pros • Fixed packet loss • Unique metric names per host • Cons • Unique metric names per host • Latency only calculated per host statsd for everyone 12 statsd application statsd application statsd application haproxy carbon-relay carbon-relay
  • 13. Option #2: Sharded Statsd • Pros • Metric name not needed to be unique by host • Fixed most packet loss issues for some time • Cons • Shard mapping in client • Some statsd servers still would have packet loss • Shard mapping updating statsd for different names 13 application haproxy carbon-relay carbon-relay application application statsd statsd statsd metric.a metric.b metric.c
  • 14. Multiple Graphite Clusters everybody gets a cluster (mid 2013) 14 Application (python) Statsd Servers (python) Graphite Cluster (Java app) Application (java) Statsd Servers (java) Graphite Cluster (Python app)
  • 15. User Quote • “Graphite isn't powerful enough to handle two globs in a request, so ‘obelix.pin.prod.*.*.metrics.coll.p99’ doesn't return anything most of the time. With just one glob it usually works, but it can be very slow.” on querying metrics in Graphite 15
  • 16. Monitoring* Timeline Time Series Tools 16 Pinterest Launched Graphite Deployed Ganglia for system metrics 2010 20122011 2013 2014 2015 2016 OpenTSDB Deployed *The action of observing and checking the behavior and outputs of a system and its components over time.
  • 17. User Quote • “… convinced me to try out OpenTSDB, and I am VERY GLAD they did. The interface isn't perfect, but it does let you construct queries quickly, and the data is all there, easy to slice by tag and *fast*. I couldn't be happier, and it has saved me hours of frustration and confusion over the last few days while tracking down latency issues in our search clusters.” on using OpenTSDB 17
  • 18. Statsd still broken never fixed real issue 18
  • 19. Graphs are Just Wrong too many metrics dropped 19
  • 20. User Quotes • “At this point I would give just about anything for a time-series database that I could trust. The numbers coming out of graphite from the client and server sides don't match, and neither of them match with the ganglia numbers.” • “I don't know which to trust; even the shapes are different, so I'm no longer convinced that the relative changes are right. That makes it hard for me to tell if my theories are wrong, or the numbers are wrong, or both.” on time series metrics 20
  • 21. Replace Statsd Server • Local metrics-agent • Kafka • Storm by adding 3 new components 21
  • 22. Metrics-agent • Gatekeeper for time series data • Interface for OpenTSDB and StatsD • different ports • Sends metrics to Kafka • Needed to convert o Kafka pipeline with no downtime • Double write to existing StatsD servers and Kafka everybody gets an agent 22
  • 23. New Metrics Pipeline lambda architecture (2015) 23 Kafka Storm Batch Job metrics-agent application metrics-agent application metrics-agent application graphite cluster 1 graphite cluster 2 opentsdb cluster 1 opentsdb cluster 2
  • 24. Fixed Graphs no more packet loss 24
  • 25. Current Write Throughput • Graphite •120,000 points/second • OpenTSDB • 1.5 million points/second Graphite and OpenTSDB 25
  • 26. Statsboard • Integrates Graphite, Ganglia, OpenTSDB metrics • Adds Graphite like functions to OpenTSDB • asPercent • diffSeries • integral • sumSeries • etc. Time Series Dashboards and Alerts 26
  • 27. Statsboard Config • Dashboards - "Outbound QPS Metrics": - title: "Outbound QPS (by %)" metrics: - stat: metric_name_1 • Alerts Alert Name: threshold: metric > 100 pagerduty: service_name Yet Another YAML Config Format 27
  • 29. User Quotes on Graphite • “I'm not saying Graphite isn't evil. It's evil. I'm just saying that if you spend a week staring at it hard enough you can make some sense out of the madness :)” • “I do not believe graphite is 'evil' since this is how RRD datasets have worked since 1999.” • “I don't think anyone is complaining about rrdtool, which is as much at fault for Graphite as the Linux OS on which it runs. The problem is that you have to know a lot of things to get correct results from a Graphite plot, and none of those things are easy to find out (as John says, none of them appear on the data plot).” Graphite is Evil? 29
  • 30. What about OpenTSDB? I thought users were happy. 30
  • 31. OpenTSDB Aggregation • “Something is wrong with OpenTSDB. My lines are often unnaturally straight. Can you fix it?” What exactly is getting aggregated? 31
  • 32. Graphite User Education • What RRDs are and how to normalize across intervals • Metric summarization into next interval • Getting requests/second from a timer • Difference between stats and stats_counts • Should I use hitcount or integral to calculate totals? Train Users on System 32
  • 33. OpenTSDB User Education • Getting data from continually incrementing counters • Interpolation of data points • How aggregation works • Query Optimization Train Users on System — OpenTSDB 33
  • 34. What else have we learned? Besides system architecture and doing user education 34
  • 35. Protect System from Clients • Alert on unique metrics • Block metrics using Zookeeper Must control incoming metrics 35 metrics-agent application opentsdb zookeeper counts by common prefix Alert on Prefix Count on-call engineer prefix block list
  • 36. Trusting the Data • Cannot control how users use the data • Do not want business decisions off of wrong data • Measuring data accuracy is hard • Count metrics generated vs. metrics written at every phase. • Lots of places a metric can get lost and not known that it was lost Need to measure data points lost 36
  • 37. Lessen Aggregator Overhead • StatsD performs network call to update a metric • Manually tune sample rate to lessen overhead (time consuming) • Java uses Ostrich library for in process aggregation Ideally In Process 37 metrics-agent Java Application Ostrich metrics-agent StatsD Client
  • 38. Lessen Operational Overhead • More tools, more overhead • Adding boxes to Graphite is hard • Adding boxes to OpenTSDB is easy • More monitoring systems, more monitoring of the monitoring system • Removing a tool in production is hard • Ganglia, Graphite, and OpenTSDB all still running • As product gets more 9s so must the monitoring tools. Fewer Tools? 38
  • 39. Set User Expectations • Data has a lifetime • Unless otherwise conveyed, most users expect data to exist indefinitely. • Not magical data warehouse tools that return data instantly • Not all metrics will be efficient I didn’t expect this talk to go on so long 39
  • 40. Summary • Match the monitoring system to where the company is at • User education is key to scale these tools organizationally • Tools scale with number of engineers not users of site Thanks for listening 40