A monitoring system is arguably the most crucial system to have in place when administering and tweaking the performance of any database system. DBAs also find themselves with a variety of monitoring systems and plugins to use; ranging from small scripts in cron to complex data collection systems. In this talk, I’ll discuss how Box made a shift from the Cacti monitoring system and other various shell scripts to OpenTSDB and the changes made to our servers and daily interaction with monitoring to increase our agility in identifying and addressing changes in database behavior.
10. OpenTSDB is...
• Distributed
• Scalable
• Time Series Database
• Runs on HBase
• Created By
Benoit Sigoure
HBase
TSD for
Querying
mydb.example.com
HAProxy
fe1.example.com
TSD for
Storing
Push
Metrics
Query via API
11. • FAST
• EASY to Scale
• EASY to Populate
• EASY to collect data
• EASY to Query
Why OpenTSDB?
31. Table Info from I_S
SELECT *, DATA_LENGTH+INDEX_LENGTH AS TOTAL_LENGTH
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT IN
('PERFORMANCE_SCHEMA','INFORMATION_SCHEMA')
33. And other “common” metrics
• Various MySQL status counters
• QPS (questions)
• Threads connected
• Temporary tables on disk
• Etc.
• Various server statistics
• %CPU Idle
• Free disk space
• I/O utilization
• Network traffic
• Etc.
34. Future collectors
• pt-query-digest/mysqlslow query statistics
• Data from “show engine innodb status”
• (that is missing from counters)
• PERFORMANCE_SCHEMA (MySQL 5.6+)
• Query statistics
• Processlist information
• Background thread information
37. In all seriousness, though...
• Easily see aggregate graphs
• Easily build graphs on-the-fly
• Full granularity forever
• API request for raw data
• Cluster-wide nagios checks with check_tsd
38. Challenges Switching
• Aggregates are the default
• Mouse-zooming (patched!)
• Auto-suggest for metrics
• “The graphs aren’t pretty”
• Migrating from proof of concept
• Plan for 3+ machines
• Data pruning may be required
39. Some
Quick
Numbers OpenTSDB @ Box
21,294 metrics
72 tag keys
5,145,745 tag values
90% Interactive graphs
return <300ms
Will be talking about OpenTSDBHow OpenTSDB changed monitoring at boxHow we leverage it’s abilities for day-to-day management of MySQL DBs
Youprobablyhave the perconacactigraphs and monitoring plugins
Youaddsomeothernagioschecks for funedgecases
And you use different tools from the percona toolkit like:StalkPoor man’s profiler (PMP)Query Digest
Suddenly finding problems and correlating issues is difficultMaybe you don’t have a NOC yetMaybe you do, and they need better graphs
IT’S BIGGER ON THE INSIDE – just kiddingFast!Easy to build graphs on the flyHella easy to scale – just add nodes (HBase or TSDs)Very easy to put data into it – NEXT SLIDES TALK ABOUT THIS YO
Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
Zoom in to get the exact time interval
Know the exact time of a high stack upGo to check Box Anemometer to see what query is there
This is the URL for thatCan easily paste this to anyone to see the same interactive graph
If you prefer text, that’s also an option via APIYou can build cool tools using the APIWeek over Week graphsSimplifies anomaly detectionURL is pretty simpleEffectively just use “q?” and add “&ascii”
Get audit log:LoginsTypes of statements issuedEtc.
Get performance information about:Row and index change activityRow read activity
Generate daily reports of:Are auto increments columns nearing a boundary on a table?Number of records in a tableSize of a datafile for a table
Using pt-tcp-modelAllows us to identify when server stops doing work5min interval
Aggregate graphs are the defaultDrill down only when problems in aggregate
Aggregatesare thedefault–shift in thinking from lookingatspecificimportantservers.Zooming in on a timeslice was painfullymanual– I wroteup a patch to addmouse-zooming and upstreamed. Thiscementedopentsdb as a powerful monitoring tool for Box, overnightAuto-suggest for metricsisspotty– we wrote a quick cron job that dumps full metric list into JSON “Graphs aren’t pretty” – a few changes to the base GNUPlot options solved this. There’s also a “Smooth” option in the interface nowMigrating from POC – we had a single-node setup for the longest time until that fell over...a lotPlan for 3+ machines – it’s enough to run all the needed bits for a light-weight distributed HBase and TSD setupData pruning – ~4 bytes per metric before HDFS replication add up quicklymysql_tcollector - 370 metrics -- ~1.5k per server. X 30s interval = ~4.2MB/dayeither have a plan to prune old data or build out extra capacity and predict storage needs per server/metric added