InfluxData builds a time series platform primarily deployed for DevOps and IoT monitoring. This talk presents several lessons learned while scaling the platform across a large number of deployments—from single server open source instances to highly available high-throughput clusters.
This talk presents a number of failure conditions that informed subsequent design choices. Ryan Betts (Director of Engineering at InfluxData) will discuss designing backpressure in an AP system with tens of thousands of resource-limited writers; trade-offs between monolithic and service-oriented database implementations; and lessons learned implementing multiple query processing systems.
13. Agenda
• 2.0 key differences
• Path to 2.0 OSS beta
• 2019 internals focus
• 1.x release and sustaining model
14. Path to 2.0 OSS Beta
• Weekly Alpha releases adding new functionality + testing
• Features
• InfluxQL Transpiler
• DELETE with predicate
• Bulk Import (1.x, 2.x)
• Bulk Export
• Community process (issue templates, GitHub labels, milestone
communication…)
15. Flux Release Train
• Weekly releases
• Deployed to Cloud2
• Weekly with 2.0 OSS Alpha
• Monthly with 1.7.x InfluxDB
• https://github.com/influxdata/flux/releases
16. Agenda
• 2.0 key differences
• Path to 2.0 OSS beta
• 2019 internals focus
• 1.x release and sustaining model
17. 2019 Internals Points of Emphasis
• Community responsiveness
• DELETE correctness
• Load-shedding and back-pressure
• Query resource limits
18. TSM
Observations
• Write amplification rarely a concern
• Compaction memory & cpu utilization often a
concern
• Backfilling is common - as a special case of bulk
load
• Range deletes with a predicate are common
• Offline tooling is surprisingly popular
• TSM space efficiency can be very variable
21. Agenda
• 2.0 key differences
• Path to 2.0 OSS beta
• 2019 internals focus
• 1.x release and sustaining model
22. 2019 Release Train for 1.x
• Monthly InfluxDB releases for 1.7, 1.6, 1.5 (on demand)
• Chronograf releases paired with InfluxDB 1.7 (for Flux)
• Kapacitor released as necessary
• 1.8 InfluxDB release as vehicle for Flux GA
Want to highlight not just the product and the time to awesome aspects of this - but also the operational and community and impacts.
UI / DB in the same process space. If you monitor influxdb internals, you’ll see UX, Tasks, Scrapers, outbound-io, etc.
Consolidates not just the products into a single binary but also consolidates our Chronograf, Kapacitor, InfluxDB GitHub activity into the `influxdb` repo.
We have some new practices to learn / develop in support now that these previously separate components run in a single process space.
Few people like the continuous query experience:
* too hard to write / try / observe
* too hard to debug
* too hard to re-run on failure
* too hard to report run results
OSS, Enterprise storage nodes, Enterprise meta nodes, … all different interfaces. Now consolidated.
OSS, Enterprise 1.x have different permission systems. Some changes require the ./influx CLI; some are available via REST. None of those REST endpoints are well documented (a criticism of development - not our docs team!)
Shipped FGA to “fake” some separation of tenants — but it adds complexity that’s not worth its weight.
Retention v. Series creation.
Database physical arrangement vs. bucket logical arrangement.
- designed to support powerful language services to drive graphical user experiences that can't be created at a reasonable engineering expense otherwise. We want to remove the programming from time series and present a better UX
- designed to integrate across disparate datasources (a key argument here is that SQL systems are compatible and integration code is necessary -- so why not make integration a first class language design feature)
- ability to cross-compile; time-series is all about eco-system and we want to embrace prom, sql, flux, arrow, ... all on a single optimizer and access path engine. So we wrote the underlying language tools to do that.
- in addition to integrating other syntaxes, we want to integrate with other programming and analytics environments -- like Jupyter notebooks. We want to integrate with those using arrow - for efficiency, scale, and interoperability - and using the query syntax the user wants.
- we need a 4GL data scripting language for alerting and other time series use cases. We don't PLSQL. Barf. Other SQL systems also reach for non-SQL languages when they needs this in stored procedure engines (javascript, Lua, java..)
Personal Background: 20 years building infrastructure software as an engineer, technical lead, CTO, and director of engineering.
Have worked with at two DB companies maturing 1.0 products
Talk is about a few things: scaling some of the softer side and also some specific InfluxDB internals
Community: ACES, DevRel + client code, Community Manager, InfluxDB triage with PM
Delete: DML not Query code path
TLA Model
Always the victim
Complicates load shedding — more critical
ADDS I/O in a reduced functional state
Exacerbates “thundering herd” on recovery
Is only a partial solution anyway
In our case, is much less dense than compacted blocks; not efficient
Example of doing more work with telegraf replaying timed-out data leading to more merging
Also a design flaw in current influx that deletes are processed by the query engine, not the write path and don’t flow through HH - so HH can be cause of inconsistency.
Personal Background: 20 years building infrastructure software as an engineer, technical lead, CTO, and director of engineering.
Have worked with at two DB companies maturing 1.0 products
Talk is about a few things: scaling some of the softer side and also some specific InfluxDB internals