Inside the InfluxDB storage engine

© 2017 InfluxData. All rights reserved.1
Inside the InfluxDB Storage
Engine
Gianluca Arbezzano
gianluca@influxdb.com
@gianarb

What is time series data?

Stock trades and quotes

Metrics

Analytics

Events

Sensor data

Two kinds of time series
data…

Regular time series
t0 t1 t2 t3 t4 t6 t7
Samples at regular intervals

Irregular time series
t0 t1 t2 t3 t4 t6 t7
Events whenever they come in

Why would you want a
database for time series
data?

Scale

Example from server monitoring
• 2,000 servers, VMs, containers, or sensor units
• 1,000 measurements per server/unit
• every 10 seconds
• = 17,280,000,000 distinct points per day

Compression

Aging out data

Downsampling

Fast range queries

TSDB

Inverted Index

preliminary intro materials…

Everything is indexed by time and
series

Shards
10/11/2015 10/12/2015
Data organized into Shards of time, each is an underlying DB
efficient to drop old data
10/13/201510/10/2015

InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126

InfluxDB data
Measurement

InfluxDB data
Measurement Tags

InfluxDB data
Measurement Tags Fields

InfluxDB data
Measurement Tags
(tagset all
together)
Fields Timestamp

InfluxDB data
Measurement Fields Timestamp
We actually store up to ns scale timestamps
but I couldn’t fit on the slide
Tags
(tagset all
together)

Each series and field to a unique ID
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2

Data per ID is tuples ordered by time
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
1 (1443782126,80)
2 (1443782126,18)

Arranging in Key/Value Stores
1,1443782126
Key Value
80
ID Time

1,1443782126
Key Value
80
2,1443782126 18

1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81 new data

1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81
key space
is ordered

1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81
2,1443782256 15
2,1443782130 17
3,1443700126 18

Many existing
storage engines
have this model

New Storage Engine?!

First we used LSM Trees

deletes expensive

too many open file handles

Then mmap COW B+Trees

write throughput

compression

met our requirements

High write throughput

Awesome read performance

Better Compression

Writes can’t block reads

Reads can’t block writes

Write multiple ranges
simultaneously

Many databases open in a single
process

Enter InfluxDB’s
Time Structured Merge Tree
(TSM Tree)

Enter InfluxDB’s
Time Structured Merge Tree
(TSM Tree)
like LSM, but different

Components
WAL
In
memory
cache
Index
Files

Components
WAL
In
memory
cache
Index
Files
Similar to LSM
Trees

Components
WAL
In
memory
cache
Index
Files
Similar to LSM
Trees
Same

Components
WAL
In
memory
cache
Index
Files
Similar to LSM
Trees
Same
like
MemTables

Components
WAL
In
memory
cache
Index
Files
Similar to LSM
Trees
Same
like
MemTables
like SSTables

awesome time series data
WAL (an append only file)

in memory index

in memory index
on disk index
(periodic flushes)

in memory index
on disk index
(periodic flushes)
Memory
mapped!

TSM File

Compression

Timestamps: encoding based on
precision and deltas

Timestamps (best case):
Run length encoding
Deltas are all the same for a block

Timestamps (good case):
Simple8B
Ann and Moffat in "Index compression using 64-bit words"

Timestamps (worst case):
raw values
nano-second timestamps with large deltas

float64: double delta
Facebook’s Gorilla - google: gorilla time series facebook
https://github.com/dgryski/go-tsz

booleans are bits!

int64 uses double delta, zig-zag
zig-zag same as from Protobufs

string uses Snappy
same compression LevelDB uses
(might add dictionary compression)

Updates
Write, resolve at query

Deletes
tombstone, resolve at query & compaction

Compactions
• Combine multiple TSM files
• Put all series points into same file
• Series points in 1k blocks
• Multiple levels
• Full compaction when cold for writes

Example Query
select percentile(90, value) from cpu
where time > now() - 12h and “region” = ‘west’
group by time(10m), host

Example Query
select percentile(90, value) from cpu
where time > now() - 12h and “region” = ‘west’
group by time(10m), host
How to map to
series?

Inverted Index!

Inverted Index
cpu,host=A,region=west#idle -> 1
cpu,host=B,region=west#idle -> 2
series to ID

Inverted Index
cpu -> [idle] measurement to fields
series to ID

Inverted Index
cpu -> [idle]
host -> [A, B]
measurement to fields
host to values
series to ID

Inverted Index
cpu -> [idle]
host -> [A, B]
region -> [west]
host to values
region to values
series to ID

Inverted Index
cpu -> [idle]
host -> [A, B]
region -> [west]
cpu -> [1, 2]
host=A -> [1]
host=B -> [1]
region=west -> [1, 2]
host to values
region to values
series to ID
postings lists

Index V1
• In-memory
• Load on boot
• Memory constrained
• Slower boot times with high cardinality

Index V2

in memory index on disk index (do we already have?)
time series meta data

nope

nope
on disk indices
(periodic flushes)

nope
on disk indices
(periodic flushes)
(compactions)
on disk index

Index File Layout

Example Key Exists Lookup
[ 76, 234, 129, 352 ] File locations

[ 76, 234, 129, 352 ]
cpu,host=serverA,region=west#idle

Robin Hood Hashing
• Can fully load table
• No linked lists for lookup
• Perfect for read-only hashes

[ , , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths

[ , , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
A ->
0

[ A, , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
A ->
0

[ A, , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
B ->
1

[ A, B, , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
B ->
1

[ A, B, , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
C ->
1

[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
C ->
2

[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
C -> probe
1

[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
D ->
0

[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
D -> probe
1

[ A, D, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 0, 0 ]
Keys
Probe Lengths
B -> probe
1

[ A, D, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 0, 0 ]
Keys
Probe Lengths
B -> probe
2

[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
B -> probe
2

Rob probe rich, give to probe
poor

Refinement: average probe

Cache Hit
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe LengthsAverage: 1

Cache Hit
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe LengthsAverage: 1
D -> hashes to 0 +
1

Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Z -> hashes to
0

Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Z -> move probe
1

Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Z -> move probe
2

Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Max Probe 2, so Z not
present

Cardinality Estimation

HyperLogLog++

Gianluca Arbezzano
gianluca@influxdb.com
@gianarb
Thank you.

Inside the InfluxDB storage engine

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Inside the InfluxDB storage engine

Ähnlich wie Inside the InfluxDB storage engine (20)

Mehr von InfluxData

Mehr von InfluxData (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Inside the InfluxDB storage engine