2. SPEAKER
WHO IS THIS GUY?
▸ Sr. Database Architect at Medallia
▸ Recent fun employments:
▸ Principal Database Engineer@ WithMe
▸ Lead Database Architect @ OmniTI
▸ Expertise in PostgreSQL , Oracle, MySQL, NoSQL
▸ Contact : denish.j.patel@gmail.com or dpatel@medallia.com
▸ Twitter: @DenishPatel
▸ Blog: http://www.pateldenish.com
▸ Postgres Slack Channel (https://postgres-slack.herokuapp.com/)
2
3. AGENDA
DISCUSSION LIST
▸ What to look for monitoring solution in general?
▸ Comparison - selected open source and commercial monitoring solutions
▸ Which metrics to collect and how?
▸ Which metrics to alert on and how to define thresholds?
▸ How to keep up with monitoring changes ?
▸ How to react on alerts at 3AM?
▸ Open discussion
3
4. SELECTION CRITERIA
WHAT TO LOOK FOR IN MONITORING SOLUTION ?
▸ Blend of system monitoring with Postgres support
▸ Centralized monitoring
▸ Hosted vs On-premise
▸ Security concerns of clients
▸ Alerting and Dashboard/Graphs
▸ Easy installation and configuration
▸ Postgres Support
▸ pg_stat_statements
▸ Resource monitoring - CPU, RAM, DISK IO & Network
▸ pgbouncer support
4
5. COMPARISON
MONITORING SOLUTIONS
▸ Open Source
▸ Sensu
▸ Zabbix
▸ Zenoss (Limited capabilities)
▸ Nagios
▸ (Stop using Nagios so it can die peacefully!!)
▸ SAAS Offerings
▸ Wavefront
▸ Circonus
▸ Vividcortex
▸ OkMeter
▸ NewRelic
5
7. COMPARISON
SAAS OFFERINGS
7
Postgres support Configuration Confidence
Wavefront
Collectd
Yes collectd plugins HIGH
Circonus Yes Default checks HIGH
Vividcortex Yes Default checks HIGH
Okmeter Yes
One click install
pgbouncer
HIGH
New Relic Yes
Plugins - missing
some metrics
MEDIUM
9. WAVEFRONT.COM
WAVEFRONT
▸ Nice Dashboard and alerting functionality
▸ Very scalable solution
▸ Works with existing metrics collection tools i.e collectd
▸ Real time analytics capability
▸ Complete monitoring suite
9
10. OKMETER.IO
OKMETER
▸ It is agent based system so you just need to install agent in your environment
to monitor application, database or any other servers
▸ Very easy to install and configure
▸ Provides easy to configure Postgres Server monitoring using
pg_stat_statements with server stats. Once you install agent, you get
everything without any effort
▸ Built-in pgbouncer monitoring
▸ Built-in all resources monitoring ; Disk, CPU, Network & Memory
10
12. USE CASE
MONITORING SOLUTION
▸ 150+ DB clusters across the globe
▸ Easy installation
▸ Standardization
▸ Centralized solution
▸ Real time analytics
▸ Support new Infra - Docker/Aurora/Mesos
12
13. METRICS COLLECTION
SETUP ROLE
13
create role collectd login encrypted password 'XXX';
create schema collectd;
set search_path = collectd,pg_catalog;
grant usage on schema collectd to collectd;
alter role collectd set search_path = collectd,pg_catalog;
▸ Things to consider:
▸ Separate role for monitoring
▸ No SUPER ROLE
▸ Limited permissions
16. METRICS COLLECTION
PG_STAT_ACTIVITY
16
create or replace function pg_stat_activity()
returns set of pg_catalog.pg_stat_activity
as $$
begin
return query(select * from pg_catalog.pg_stat_activity);
end $$
language plpgsql security definer;
revoke all on function pg_stat_activity() from public;
grant execute on function pg_stat_activity() to collectd;
17. METRICS COLLECTION
TRANSACTIONS
17
<Query transactions>
Statement "SELECT xact_commit, xact_rollback
FROM pg_stat_database
WHERE datname = $1;"
Param database
<Result>
Type "pg_xact"
InstancePrefix "commit"
ValuesFrom "xact_commit"
</Result>
<Result>
Type "pg_xact"
InstancePrefix "rollback"
ValuesFrom "xact_rollback"
</Result>
</Query>
18. METRICS COLLECTION
QUERIES (DML)
18
<Query queries>
Statement "SELECT sum(n_tup_ins) AS ins,
sum(n_tup_upd) AS upd,
sum(n_tup_del) AS del,
sum(n_tup_hot_upd) AS hot_upd
FROM pg_stat_user_tables;"
<Result>
Type "pg_n_tup_c"
InstancePrefix "ins"
ValuesFrom "ins"
</Result>
..
.
.
.
</Query>
19. METRICS COLLECTION
TABLE_STATES
19
<Query table_states>
Statement "SELECT sum(n_live_tup) AS live, sum(n_dead_tup) AS dead
FROM pg_stat_user_tables;"
<Result>
Type "pg_n_tup_g"
InstancePrefix "live"
ValuesFrom "live"
</Result>
<Result>
Type "pg_n_tup_g"
InstancePrefix "dead"
ValuesFrom "dead"
</Result>
</Query>
20. METRICS COLLECTION
QUERY_PLANS
20
<Query query_plans>
Statement "SELECT sum(seq_scan) AS seq,
sum(seq_tup_read) AS seq_tup_read,
sum(idx_scan) AS idx,
sum(idx_tup_fetch) AS idx_tup_fetch
FROM pg_stat_user_tables;"
<Result>
Type "pg_scan"
InstancePrefix "seq"
ValuesFrom "seq"
.
.
</Query>
21. METRICS COLLECTION
DISK_IO
21
<Query disk_io>
Statement "SELECT coalesce(sum(heap_blks_read), 0) AS heap_read,
coalesce(sum(heap_blks_hit), 0) AS heap_hit,
coalesce(sum(idx_blks_read), 0) AS idx_read,
coalesce(sum(idx_blks_hit), 0) AS idx_hit,
coalesce(sum(toast_blks_read), 0) AS toast_read,
coalesce(sum(toast_blks_hit), 0) AS toast_hit,
coalesce(sum(tidx_blks_read), 0) AS tidx_read,
coalesce(sum(tidx_blks_hit), 0) AS tidx_hit
FROM pg_statio_user_tables;"
22. METRICS COLLECTIONS
DISK USAGE / DB SIZE
22
<Query disk_usage>
Statement "SELECT pg_database_size($1) AS size;"
Param database
<Result>
Type pg_db_size
ValuesFrom "size"
</Result>
</Query>
23. METRICS COLLECTION
CONNECTIONS #CUSTOM
23
<Query connections>
Statement "SELECT COUNT(state) AS count, state FROM (SELECT CASE
WHEN state = 'idle' THEN 'idle'
WHEN state = 'idle in transaction' THEN 'idle_in_transaction'
WHEN state = 'active' THEN 'active'
ELSE 'unknown' END AS state
FROM collectd.pg_stat_activity) state
GROUP BY state
UNION
SELECT COUNT(*) AS count, 'waiting' AS state
FROM collectd.pg_stat_activity WHERE waiting ;"
<Result>
Type "pg_numbackends"
InstancePrefix "state"
InstancesFrom "state"
ValuesFrom "count"
</Result>
</Query>
27. METRICS COLLECTION
LOCKS
27
<Query locks>
Statement "SELECT COUNT(mode) AS count, mode FROM pg_locks GROUP BY mode
UNION SELECT COUNT(*) AS count, 'waiting' AS mode FROM pg_locks
WHERE granted is false ;"
<Result>
Type "gauge"
InstancePrefix "pg_locks"
InstancesFrom "mode"
ValuesFrom "count"
</Result>
</Query>
29. METRICS COLLECTION
WAL_FILES
29
<Query wal_files>
Statement "SELECT archived_count AS count, failed_count AS failed FROM pg_stat_archiver;"
<Result>
Type "gauge"
InstancePrefix "pg_wal_count"
ValuesFrom "count"
</Result>
<Result>
Type "gauge"
InstancePrefix "pg_wal_failed"
ValuesFrom "failed"
</Result> </Query>
30. METRICS COLLECTION
SCANS
30
<Query scans>
Statement "SELECT sum(idx_scan) as index_scans, sum(seq_scan) as seq_scans,
sum(idx_tup_fetch) as index_tup_fetch, sum(seq_tup_read) as seq_tup_read
FROM pg_stat_all_tables ; "
<Result>
Type "pg_scan"
InstancePrefix "index"
ValuesFrom "index_scans"
</Result>
.
.
</Query>
31. METRIC COLLECTION
SEQ_SCANS
31
<Query seq_scans>
Statement "SELECT CASE WHEN status='OK' THEN 0 ELSE 1 END AS status
FROM ( SELECT get_seq_scan_on_large_tables AS status
FROM collectd.get_seq_scan_on_large_tables) AS foo;"
<Result>
Type "gauge"
InstancePrefix "pg_seq_scans"
ValuesFrom "status"
</Result>
</Query>
32. METRICS COLLECTION
SEQ_SCAN_ON_LARGE_TABLES
32
CREATE MATERIALIZED VIEW collectd.seq_scan_on_large_tables AS
SELECT relid, schemaname, relname, seq_scan, seq_tup_read ,
pg_relation_size(relid) as relsize, now() as refreshed_at
FROM pg_stat_all_tables
WHERE pg_relation_size(relid) > 1073741824
AND schemaname not in ('pg_catalog', 'information_schema')
UNION ALL SELECT 0,'0','0','0',0,0,now();
ALTER materialized VIEW collectd.seq_scan_on_large_tables OWNER TO collectd;
33. METRICS COLLECTION
GET_SEQ_SCAN_ON_LARGE_TABLES
33
CREATE OR REPLACE FUNCTION collectd.get_seq_scan_on_large_tables()
RETURNS text AS
$$
DECLARE
v_matview text;
v_refreshed_at timestamptz;
v_tables_with_seq_scan text[];
BEGIN
SELECT refreshed_at INTO v_refreshed_at
FROM collectd.seq_scan_on_large_tables WHERE relid=0;
-- refresh MV every 4 hours
IF v_refreshed_at < now() - interval '4 hours' and pg_is_in_recovery() is false THEN
REFRESH MATERIALIZED VIEW collectd.seq_scan_on_large_tables;
END IF;
SELECT ARRAY (SELECT base.relname ||':'|| (current.seq_scan-base.seq_scan) INTO v_tables_with_seq_scan
FROM collectd.seq_scan_on_large_tables AS base
LEFT JOIN pg_stat_all_tables AS current ON (base.schemaname=base.schemaname AND base.relname=current.relname)
WHERE (current.seq_scan-base.seq_scan) > 0 AND ((current.seq_tup_read-base.seq_tup_read)/(current.seq_scan-base.seq_scan)) > 50000 ) AS
tables_with_seq_scan;
IF v_tables_with_seq_scan = '{}' THEN
RETURN 'OK';
ELSE
RETURN 'PROBLEM: Seq scan on table: '|| array_to_string(v_tables_with_seq_scan,'&');
END If;
END;
$$
LANGUAGE 'plpgsql' SECURITY DEFINER;
34. METRICS COLLECTION
AVG_QUERYTIME
34
<Query avg_querytime>
Statement "SELECT sum(total_time)/sum(calls) AS avg_querytime FROM
collectd.get_stat_statements() ;"
<Result>
Type "gauge"
InstancePrefix "pg_avg_querytime"
ValuesFrom "avg_querytime"
</Result>
</Query>
<Query scans>
35. METRICS COLLECTION
GET_STAT_STATEMENTS
35
create extension IF NOT EXISTS pg_stat_statements WITH SCHEMA collectd;
alter schema collectd owner to collectd;
CREATE OR REPLACE FUNCTION collectd.get_stat_statements() RETURNS SETOF
pg_stat_statements AS
$$
SELECT * FROM pg_stat_statements
WHERE dbid IN (SELECT oid FROM pg_database WHERE datname = current_database());
$$ LANGUAGE sql VOLATILE SECURITY DEFINER;
38. METRICS COLLECTIONS
CHECKPOINTS
38
<Query checkpoints>
Statement "SELECT (checkpoints_timed + checkpoints_req) AS total_checkpoints
FROM pg_stat_bgwriter ;"
<Result>
Type "counter"
InstancePrefix "pg_checkpoints"
ValuesFrom "total_checkpoints"
</Result>
</Query>
39. METRICS COLLECTION
SLAVE LAG
39
<Query slave_lag>
Statement "SELECT CASE WHEN pg_is_in_recovery = 'false' THEN 0
ELSE COALESCE(ROUND(EXTRACT(epoch FROM now() pg_last_xact_replay_timestamp())),0) END
AS seconds
FROM pg_is_in_recovery();"
<Result>
Type "counter"
InstancePrefix "slave_lag"
ValuesFrom "seconds"
</Result>
</Query>
40. ALERTING
SETUP ALERTS ON DB METRICS
▸ Uptime
▸ Waiting Connections
▸ # of connections waiting > 5
▸ Slow queries
▸ # of slow queries > 5
▸ Seq scan on large tables
▸ TXN Wraparound
▸ Age Over 1.5B
▸ Disk space usage
▸ 85%?
▸ Slave lag
▸ 5 minutes?
40
41. MONITORING CHANGES
HOW TO KEEP UP?
▸ Design with failover in mind
▸ Keep eyes on new features for monitoring in latest DB or OS version
▸ Postgres 9.5 enhancements
▸ Commit timestamp tracking
▸ SELECT * FROM pg_last_committed_xact();
▸ cluster_name
▸ $ ps -ef | grep checkpointer
▸ postgres 12181 12178 0 11:12 ? 00:00:00 postgres: personnel: checkpointer process
▸ postgres 12207 12204 0 11:12 ? 00:00:00 postgres: reportsdb: checkpointer process
▸ postgres 12233 12230 0 11:12 ? 00:00:00 postgres: management: checkpointer process
▸ A bunch of changes coming in Postgres 9.6
▸ Improve the pg_stat_activity view provides more details about waiting on what resources
▸ Deploy monitoring through config management tools 41
42. INCIDENT MANAGEMENT
HOW TO BE READY TO HANDLE 3AM CALL?
▸ PagerDuty calendar : https://www.pagerduty.com/
▸ Document metrics
▸ URL for the Dashboard
▸ Alert resolution procedure
▸ Clear SLAs (Decision)
▸ Escalation policy
▸ Scenarios
▸ Wait for server to bring backup
▸ Failover
▸ Review alerts before going OnCall
▸ Oncall notification
▸ Think for the worst and document accordingly
▸ What if you are in movie theatre/beach etc.?
▸ What if you can’t jump on the server?
▸ Keep the document up-to-date 42