In this session, Tim will cover principles, learnings, and practical advice from operating multiple cloud services at scale, including of course our InfluxDB Cloud service. What do we monitor, what do we alert on, and how did we architect it all? What are our underlying architectural and operational principles?
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | Tim Hall | InfluxData
1. Tim E. Hall @thallinflux
VP, Products InfluxData
Lessons Learned: Running
InfluxDB Cloud at Scale
2. Discussion Topics
Brief History of InfluxDB Cloud
Gathering Metrics...and Logs
Visualization, Monitoring, and Alerting
Troubleshooting Scenarios
What did we miss? So many things…
3. A Brief History of InfluxDB Cloud 1.0…
April
2016
August
2017
May
2014
• Enterprise Edition DBaaS
• Kapacitor Add-On
• Hosted on AWS
• Enterprise Edition DBaaS
• Chronograf and limited
Kapacitor included
• Co-monitoring
• Pay-as-you-go storage• Open Source DBaaS
• Hosted on Digital Ocean
4. From development
to production
• Establish monitoring baselines
• Ensure visibility into health of the system
• Notifications for most common issues, before
they become outages
5. From OSS to Enterprise
InfluxDB
OSS
Meta 1 Meta 3Meta 2
Data Node
2
Data Node
1
InfluxDB Enterprise
7. Data Nodes
InfluxDB Cloud 1: Deployment Diagram
Meta Node Quorum
Data Nodes
Kapacitor Node (optional add-on)
Kach Node
Meta Nodes
Papertrail
(log archival)
Running procs:
Docker
ssh
etcd
Running procs:
Docker
ssh
etcd
Running procs:
Docker
ssh
etcd
Designates:
Docker
Container
Kapacitor
(Chronograf
access only)
Automatron
LogSpout
SkyDNS
Telegraf
InfluxData
Monitoring
InfluxData
Provisioning
Chronograf
Automatron
LogSpout
Telegraf
SkyDNS
Running procs:
Docker
ssh
etcd
Browser-
based
access
CLI and/or
Programmatic
Access
:8086 (Data Node)
:9092 (Kapacitor
Node)
:443
TLS Listeners
:8088 (Chronograf)
:443
TLS Listeners
InfluxEnterprise
Meta
InfluxEnterprise
Data
Automatron
LogSpout
Telegraf
SkyDNS
Kapacitor
SkyDNS
Automatron
LogSpout
Telegraf
ALB
(Shared
across n
clusters)
Shared Security Group
(Open ports between nodes)
:3000
:4001
:7001
:8083, :8086, :8088, :8089, :8091
:9092
Other Port Access
:46939 – Provisioning System
:22 – open to bastion host only
(for ssh)
8. Description of common processes and services
within InfluxCloud
Running processes
– Each node has the following processes running
• Docker -- container infrastructure within which ALL InfluxEnterprise components execute
• ssh – secure shell to allow for secure, remote login
• etcd – provides common rendezvous point for InfluxDB Enterprise components in the event of
changes in the underlying infrastructure
– Docker containers common across nodes
• LogSpout gathers InfluxEnterprise related log outputs and delivers them to PaperTrail for storage,
archival and search.
• Telegraf gathers and metrics and events from the systems services and InfluxEnterprise
components to facilitate remote monitoring
• Automatron is a custom built provisioning infrastructure which allows for delivery of software
updates to any of the containers deployed across the nodes.
Papertrail
(log archival)
Automatron
LogSpout
Telegraf
InfluxData Monitoring
InfluxData
Provisioning
SkyDNS
Running procs:
Docker
ssh
etcd
9. Deploy Telegraf on all nodes (meta and data)
By enabling these plugins, KPI’s routinely associated with infrastructure and database performance can
be measured and serve as a good starting point for monitoring.
Minimum Recommendation:
1. CPU: collects standard CPU metrics
2. System: gathers general stats on system load
3. Processes: uptime, and number of users logged in
4. DiskIO: gathers metrics about disk traffic and timing
5. Disk: gathers metrics about disk usage
6. Mem: collects system memory metrics
7. NetStat: Network related metrics
8. http_response: Setup local ping
9. filestat: Files to gather stats about (meta node only)
10. InfluxDB: gather stats from the InfluxDB Instance. (data node only)
Optional:
1. Logs: requires syslog
2. Swap: collects system swap metrics
3. Internal: gather Telegraf related stats
4. Docker: if deployed in containers
10. Telegraf Configuration: Global
[global_tags]
cluster_id = $CLUSTER_ID
environment = $ENVIRONMENT
[agent]
interval = "10s"
round_interval = true
metric_buffer_limit = 10000
metric_batch_size = 1000
collection_jitter = "0s"
flush_interval = "30s"
flush_jitter = "30s"
debug = false
hostname = ""
All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by
activating them.
Global tags can be specified in the [global_tags]
section of the config file in key="value" format. Use
a GUID which uniquely identifies each “cluster” and
ensure that environment variable exists consistently
on all hosts (meta and data). Optionally, add other
tags if desired. Example: dev, prod for environment.
Agent Configuration recommended config settings
for InfluxDB data collection. Adjust the interval and
flush_interval based on:
● desire around “speed of observability”
● retention policy for the data
11. Telegraf Configuration: Inputs (common)
# INPUTS
[[inputs.cpu]]
percpu = false
totalcpu = true
fieldpass = ["usage_idle",
"usage_user", "usage_system",
"usage_steal"]
[[inputs.mem]]
[[inputs.netstat]]
[[inputs.system]]
[[inputs.diskio]]
Input Configuration items include grabbing metrics
from the various infrastructure, database, and
system components in play.
For the other plug-ins, default config is sufficient.
12. Telegraf Configuration: Inputs Data Nodes
# INPUTS
[[inputs.influxdb]]
interval = "15s"
urls = ["http://<localhost>:8086/debug/vars"]
timeout = "15s”
[[inputs.http_response]] #DATA
address = "http://<localhost>:8086/ping”
[[inputs.disk]]
mount_points =
["/var/lib/influxdb/data","/var/lib/influxdb/wal",
"/var/lib/influxdb/hh”,"/"]
InfluxDB grabs all metrics from the
exposed endpoint.
http_response allows you to ping
individual data nodes and track
response output.
You can also setup a separate Telegraf
agent elsewhere within your
infrastructure to ping the available
cluster(s) through the load balancer.
disk allows you to configure the
various volumes/mount points on
disk -- locations of data, wal, hinted
handoff -- and root. (default config
options shown)
13. Telegraf Configuration: Inputs Meta Nodes
# INPUTS
[[inputs.http_response]] #META
address = "http://<localhost>:8091/ping"
[[inputs.filestat]]
files =
["/ivar/lib/influxdb/meta/snapshots/*/state.bin"]
md5 = false
[[inputs.disk]]
mount_points = ["/var/lib/influxdb/meta", "/"]
http_response allows you to ping
individual meta nodes and track response
output.
filestat allows you to monitor metadata
snapshots.
disk allows you to configure the
various volumes/mount points on
disk -- locations of meta store -- and
root. (default config options shown)
14. Telegraf Configuration: Outputs
# OUTPUTS
[[outputs.influxdb]]
urls = [ "<target URL of DB>" ]
database = "telegraf"
retention_policy = "autogen"
timeout = "10s"
username = <uname>
password = <pword>
content_encoding = "gzip"
Output Configuration tells telegraf which
output sink to send the data. Multiple
output sinks can be specified in the
configuration file.
** NOTE: This should point to the load
balancer, if you are storing the metrics into
a cluster.
15. Telegraf Configuration: Gathering Logs
# INPUT
[[inputs.syslog]]
# OUTPUTS
[[outputs.influxdb]]
urls = [ "http://localhost:8086" ]
database = "telegraf"
# Drop all measurements that start
with "syslog"
namedrop = [ "syslog*" ]
[[outputs.influxdb]]
urls = [ "http://localhost:8086" ]
database = "telegraf"
retention_policy = "14days"
# Only accept syslog data:
namepass = [ "syslog*" ]
Output Configuration use
namepass/namedrop to
direct metrics/logs to
different db.rp targets
** NOTE: This should point
to the load balancer, if you
are storing the metrics into
a cluster.
Input Configuration add
the syslog input plug-in.
Review the settings for
your environment.
InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.
19. Disk Usage Batch Task: TICKscript
// Monitor disk usage for all hosts
var data = batch
|query('''
SELECT last(used_percent)
FROM "telegraf"."autogen"."disk"
WHERE ("host" =~ /prod-.*/)
AND ("path" = '/var/lib/influxdb/data'
OR "path" = '/var/lib/influxdb/wal'
OR "path" = '/var/lib/influxdb/hh'
OR "path" = '/')
''')
.period(5m)
.every(10m)
.groupBy('host', 'role', 'environment', 'device')
20. Disk Usage Alert: TICKscript
var warn_threshold = 85
var critical_threshold = 95
data
|alert()
.id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags
"environment" }}')
.message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index
.Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}')
.warn(lambda: "used_percent" > warn_threshold)
.crit(lambda: "used_percent" > critical_threshold)
.slack()
.channel('#monitoring')
21. Hinted Handoff Queue Batch Task: TICKscript
// This generates alerts for high hinted-handoff queues for InfluxEnterprise
var queue_size = batch
|query('''
SELECT max(queueBytes) as "max"
FROM "telegraf"."autogen"."influxdb_hh_processor"
WHERE ("host" =~ /prod-.*/)
''')
.groupBy('host', 'cluster_id')
.period(5m)
.every(10m)
|eval(lambda: "max" / 1048576.0)
.as('queue_size_mb')
22. Hinted Handoff Queue Alert: TICKscript
var warn_threshold = 3500
var crit_threshold = 5000
queue_size
|alert()
.id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id"
}}/{{ index .Tags "host" }}')
.message('Host {{ index .Tags "host" }} (cluster {{ index .Tags
"cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields
"queue_size_mb" }}MB')
.details('')
.warn(lambda: "queue_size_mb" > warn_threshold)
.crit(lambda: "queue_size_mb" > crit_threshold)
.stateChangesOnly()
.slack()
.pagerDuty()
26. Common Troubleshooting Scenarios
Workload Type
• Which type are we
looking at?
– Read heavy
– Write heavy
– Mixed?
– Establish baselines and
understand “normal”
using metrics and
visualization
– Baselines allow us to
understand change over
time and help determine
when is time to scale up
Log Analysis
• Metrics First!
– Highlights where you
should look within the
log files
• Logs allow for pin
pointing root-cause of
issue observed by
metrics
– Cache max memory size
– Hinted Handoff Queue
“Blocked”
IOPS & Disk Throughput
• Understand the
capabilities the
hardware by plan size
– Develop and review
sizing guidelines
– Understand max read
and write limits based
on machine class and
drive types – these can
change as you scale!
27. What did we miss? So many things…
Head for the balcony!
– Shift from instance-based dashboards to “fleet management”
What’s the experience of the “customer”?
– Real user monitoring from the front-door
– Integration with subscription management system
SSL Cert expiration
E-commerce system monitoring
– Health and availability of supporting components
28. Recap
Gather Metrics...and Logs (for context)
Visualize, Monitor, and Alert… tune based on your environment
Iterate and address “new” scenarios to eliminate alert fatigue
https://community.influxdata.com https://docs.influxdata.com