Nobl9 is a Service Level Objective Platform for measuring and monitoring reliability. We will look under the hood of an SLO platform using InfluxDB as part of the core architecture. We’ll talk about the project, the decisions we took, the challenges we faced, the mistakes we made, and the lessons learned.
Boost Fertility New Invention Ups Success Rates.pdf
Alex Nauda [Nobl9] | How Not to Build an SLO Platform | InfluxDays NA 2021
1. 1. What is Nobl9 and why does it use InfluxDB?
2. Features of Nobl9 supported by InfluxDB
3. Lessons learned and challenges going forward
How Not to Build an SLO Platform
AGENDA
Alex Nauda
CTO, Nobl9
Twitter @alexnauda
Email alex@nobl9.com
3. Nobl9 Architecture - Black Box View
Error
Budgets
Web
App
API
InfluxDB
PM & Business
Stakeholders
YAML
GUI
A
l
e
r
t
P
r
i
o
r
i
t
i
z
e
Raw SLIs
SLO
Config
Ops/SREs &
Application Leads
Govern
Align
New Relic Prometheus
Datadog
Calculations
Customer Platforms and Services
App
CI/CD Web Services
Data
GitOps
SLO Based
Events &
Alerts
Graphs
Reports
Review &
Align
Review &
Align
4. Why did we consider InfluxDB in the first place?
Query-friendly Time Series Database: Flexible query
capabilities to drive all our SLO charts and graphs
Deployment options: Cloud offering for our SaaS
platform; Enterprise for self-hosting customers; OSS for
dev env… with good query compatibility across them all
Commercial support: Firm requirement both for us
(managing our SaaS offering) and our customers
(self-hosting and managing Nobl9 including InfluxDB)
5. InfluxDB is useful across many of the core features of Nobl9
Calculation of SLO time series
Alerting on SLO time series
Data Intake
Data Export
Graphs and Reports
Nobl9 Feature Set
Receiving telemetry data from a variety of sources
Processing of telemetry data (SLIs) and math
A variety of visualizations, real-time and historical
Notify various integrations based on configuration
Real-time and batch exports to other tools
6. Data Intake
Telemetry Requirements
Receiving telemetry data from a variety of sources
● Support a wide variety of data sources -- 15 and counting
○ Metrics systems
○ APM
○ RUM
○ Cloud platform built-in metrics
○ Log aggregation
○ Synthetics
○ Data warehouses
● Integrate via agent (self-hosted sidecar) as well as direct connection (SaaS-to-SaaS)
● Adapt to a wide variety of integration paradigms
○ API, Query, push or pull, various authentication mechanisms
● Be robust in the face of connectivity issues and operation across the internet and other networks
● Conform to various security models at large companies (for example, support web proxies BTF)
● Configuration-based telemetry
○ Integrate well with our SLOs-as-code paradigm
○ Customers apply changes using our web UI, CLI, terraform provider, k8s operator
8. Prometheus
Server
Public Internet
Customer’s Environment
AWS WAF
Nobl9 Intake
Service
Nobl9 / AWS Cloud
m2m
Authentication
Nobl9 Agent
Prometheus does not
support authentication
directly.
Some users put Prom
behind NGINX with HTTP
Basic Auth or Client Cert
Auth. N9 Agent doesn’t
support any
authentication.
Data source credentials are
proved by customer to the N9
Agent as environment
variables. Credentials are not
sent to Nobl9
N9 Agent executes queries
against metric data sources
on defined interval using
the environment credentials
N9 Agent pools the N9 Intake
service to receive the latest
configuration.
N9 Agent pushes data to N9.
N9 Intake service can only
handle numeric float data
types.(N9 Intake cannot receive
or store PII)
Direct
Nobl9 Agent and Direct Connection Architecture
9.
10.
11. Data Intake
Telemetry Architecture
Receiving telemetry data from a variety of sources
● Based on Telegraf
○ High quality data pipeline utility
○ Widely adopted, strong community
● Extended in-house to meet our specific requirements
○ Proprietary input and output plugins
○ Doesn't send directly to InfluxDB
○ Reports data to our Data Intake REST API
○ Dynamically reloads configuration after phoning home
● Direct connection (SaaS-to-SaaS) is special and a bit different
○ But still Telegraf is a component of it
12. Calculation of SLO time series
Calculation Requirements
Processing of telemetry data (SLIs) and math
● Calculate up-to-the-minute SLOs as data arrives
● Support a wide variety of SLO features
○ Rolling windows and Calendar-aligned windows
○ Ratio metrics as well as Threshold metrics
○ Occurrences-based calculation vs time slice-based calculation
13. Calculation Design
● Original version was built in InfluxDB
○ Huge prototyping win!
○ Used InfluxQL (but could have been done in Flux)
○ Queries were really intense
● Would have to scale vertically
○ Calculating SLOs repeatedly, on the fly, is intense
○ This would be a massive database
■ Calculations are memory intensive
■ Longer SLO time windows cost more
○ Add in requirements for HA, DR… a vertically scaled
database solution is not ideal
When Your Architecture
Requires a
Vertically Scaled Database
14. Calculation of SLO time series
Calculation Architecture
Processing of telemetry data (SLIs) and math
● Rearchitected into custom code and Kafka
○ FIFO calculation approach
○ Maintains state, uses object storage as a
backing store
○ Scales horizontally
15. Graphs and Reports
Query Requirements
A variety of visualizations, real-time and historical
● Display up-to-the-minute data as values change (as new telemetry data arrives)
● Report over longer time scales as well -- over a year
● Allow users to seek through the data with a time window selector
● Support a multitude of SLOs running at once, and chart them
● Provide a wide variety of visualizations
○ SLO detail view
○ SLO grid view (list)
○ Various historical reports
○ Summaries such as Service Health Dashboard
16.
17.
18.
19.
20. Graphs and Reports
Query Architecture
A variety of visualizations, real-time and historical
● InfluxDB underlies all of this
○ All in Flux now
○ Flexibility is sufficient for a wide variety of creative uses
● Data granularity (resolution) is sometimes challenging in our use cases
○ We downsample data to hourly to display on longer time range graphs
○ We retain all the data in addition to the downsampled summary
○ Downsampling is done with InfluxDB Tasks
■ Requires some consideration for compatibility across InfluxDB codebases
21. Alerting on SLO time series
Alerting Requirements
Notify various integrations based on configuration
● Alert on configurable conditions based on SLO time series
○ Burn rate conditions
○ Error budget exhausted or partly exhausted
● Support a wide variety of alert methods and destinations
22. Alerting on SLO time series
Alerting Architecture
Notify various integrations based on configuration
● Similar architecture to calculations
○ Custom Go code
○ Hanging off the same Kafka bus as
calculations
● Requirements on the alert method integration side
are the big driver
○ Integrate with APIs of integrated alert
methods
○ Webhooks both tool-specific and a rich
custom webhook
23. Data Export
Data Export Requirements
Real-time and batch exports to other tools
● Batch data export
○ Export delimited files to cloud object storage or fs
○ Import to popular data lake tooling
■ Snowflake
■ Big Query
● Real-time data export
○ Display SLO time series alongside other metrics
○ Incorporate SLO data in existing dashboards and
visualizations
24. Data Export
Data Export Architecture
Real-time and batch exports to other tools
● Batch data export: custom
○ Manage authentication within and
across clouds/hosting
○ Manage timing and performance of
export jobs
○ Integrate with import to preferred
query system
● Real-time requirements
○ Self-hosted customers
■ Use their dedicated InfluxDB
■ Use Chronograf
■ Wire InfluxDB up to something
else if they want
○ SaaS customers
■ Real-time data feed in
development now
■ Will support popular
destinations (another InfluxDB)
25. Faceted SLOs
● Present a global SLO for a given SLI
● Provide the ability to drill down into
that SLO along various dimensions
Higher Cardinality
● This data looks more like
observability platform data
● Might consider a columnar database
or other similar data stores
Possible InfluxDB option?
● We will be watching IOx closely to
see if it could meet our needs here
Network /
ASN
Region /
Data Center
Individual
User
Geographic
Location
Feature /
User Journey
Client
Platform
& Version
B2C Customer
AZ / subnet
Challenges Going Forward
SLI
Data
Faceting
High Cardinality