Title: Dive into ClickHouse storage system
Speaker: Alexander Sapin (https://github.com/alesapin/)
Date: Thursday, March 5, 2020
Event: https://meetup.com/Athens-Big-Data/events/268379195/
5. MergeTree Engines Family
Advantages:
âș Inserts are atomic
âș Selects are blazing fast
âș Primary and secondary indexes
âș Data modification!
âș Inserts and selects donât block each other
2 / 84
6. MergeTree Engines Family
Advantages:
âș Inserts are atomic
âș Selects are blazing fast
âș Primary and secondary indexes
âș Data modification!
âș Inserts and selects donât block each other
Features:
âș Infrequent INSERTs required
3 / 84
7. MergeTree Engines Family
Advantages:
âș Inserts are atomic
âș Selects are blazing fast
âș Primary and secondary indexes
âș Data modification!
âș Inserts and selects donât block each other
Features:
âș Infrequent INSERTs required (work in progress)
4 / 84
8. MergeTree Engines Family
Advantages:
âș Inserts are atomic
âș Selects are blazing fast
âș Primary and secondary indexes
âș Data modification!
âș Inserts and selects donât block each other
Features:
âș Infrequent INSERTs required (work in progress)
âș Background operations on records with the same keys
âș Primary key is NOT unique
5 / 84
10. Create Table and Fill Some Data
Create Table:
CREATE TABLE mt (
EventDate Date,
OrderID Int32,
BannerID UInt64,
GoalNum Int8
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate) ORDER BY (OrderID, BannerID)
7 / 84
11. Create Table and Fill Some Data
Create Table:
CREATE TABLE mt (
EventDate Date,
OrderID Int32,
BannerID UInt64,
GoalNum Int8
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate) ORDER BY (OrderID, BannerID)
Fill Data (twice):
INSERT INTO mt SELECT toDate('2018-09-26'),
number, number + 10000, number % 128 from numbers(1000000);
INSERT INTO mt SELECT toDate('2018-10-15'),
number, number + 10000, number % 128 from numbers(1000000, 1000000);
8 / 84
13. Table on Disk
metadata:
$ ls /var/lib/clickhouse/metadata/default/
mt.sql
data:
$ ls /var/lib/clickhouse/data/default/mt
201809_2_2_0 201809_3_3_0 201810_1_1_0 201810_4_4_0 201810_1_4_1
detached format_version.txt
10 / 84
14. Table on Disk
metadata:
$ ls /var/lib/clickhouse/metadata/default/
mt.sql
data:
$ ls /var/lib/clickhouse/data/default/mt
201809_2_2_0 201809_3_3_0 201810_1_1_0 201810_4_4_0 201810_1_4_1
detached format_version.txt
Contents:
âș Format file format_version.txt
âș Directories with parts
âș Directory for detached parts
11 / 84
15. Parts: Details
âș Part of the data in PK order
âș Contain interval of insert numbers
âș Created for each INSERT
âș Cannot be changed (immutable)
12 / 84
16. Parts: Why?
PK ordering is needed for efficient OLAP queries
âș In our case (OrderID, BannerID)
13 / 84
17. Parts: Why?
PK ordering is needed for efficient OLAP queries
âș In our case (OrderID, BannerID)
But the data comes in order of time
âș By EventDate
14 / 84
18. Parts: Why?
PK ordering is needed for efficient OLAP queries
âș In our case (OrderID, BannerID)
But the data comes in order of time
âș By EventDate
Re-sorting all the data is expensive
âș ClickHouse handle hundreds of terabytes
15 / 84
19. Parts: Why?
PK ordering is needed for efficient OLAP queries
âș In our case (OrderID, BannerID)
But the data comes in order of time
âș By EventDate
Re-sorting all the data is expensive
âș ClickHouse handle hundreds of terabytes
Solution: Store data in a set of ordered parts!
16 / 84
27. Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
24 / 84
28. Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
Contents:
âș primary.idx â primary key on disk
25 / 84
29. Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
Contents:
âș primary.idx â primary key on disk
âș GoalNum.bin â compressed column
âș GoalNum.mrk â marks for column
26 / 84
30. Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
Contents:
âș primary.idx â primary key on disk
âș GoalNum.bin â compressed column
âș GoalNum.mrk â marks for column
âș partition.dat â partition id
27 / 84
31. Parts: Data
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1
GoalNum.bin GoalNum.mrk BannerID.bin ...
primary.idx checksums.txt count.txt columns.txt
partition.dat minmax_EventDate.idx
Contents:
âș primary.idx â primary key on disk
âș GoalNum.bin â compressed column
âș GoalNum.mrk â marks for column
âș partition.dat â partition id
âș ... â a lot of other useful files
28 / 84
33. Columns
âș Each column in separate file
âș Compressed by blocks
âș Checksums for each block
OrderID.bin
0
65795
131614
197433
Uncompressed
block
14324
17214
17215
17216
17217
Checksums
30 / 84
34. How to Use Index?
Problem:
âș Index is sparse and contain rows
âș Columns contain compressed blocks
How to match index with columns?
31 / 84
35. Solution: Marks
âș Mark â offset in compressed file and uncompressed block
âș Stored in column_name.mrk files
âș One for each index row
32 / 84
36. Solution: Marks
âș Mark â offset in compressed file and uncompressed block
âș Stored in column_name.mrk files
âș One for each index row
65795 0
OrderID.mrk
OrderID.bin
0
65795
131614
197433
0
32768
Uncompressed
block
8192
65795 32768
33 / 84
37. Put it all together
Algorithm:
âș Determine required index rows
âș Found corresponding marks
âș Distribute granules (stripe of marks) among threads
âș Read required granules
Properties:
âș Read granules concurrently
âș Threads can steal tasks
34 / 84
38. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
35 / 84
39. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
36 / 84
40. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
37 / 84
41. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
38 / 84
42. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
39 / 84
43. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
40 / 84
44. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
41 / 84
45. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
42 / 84
46. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
43 / 84
47. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
44 / 84
48. Read Data from Disk
0 10000
8192
16384
18192
26384
1998848 2008848
EventDate OrderID GoalNum
2b 4b 1b
primary.idx
KeyCondition
.mrk .bin .mrk .bin .mrk .bin
thread 1 thread 2
OrderID BannerID
SELECT any(EventDate),
max(GoalNum) FROM mt
WHERE OrderID BETWEEN
6123 AND 17345
45 / 84
50. Problem: Amount Files in Parts
Test example: Almost OK
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 | wc -l
14
47 / 84
51. Problem: Amount Files in Parts
Test example: Almost OK
$ ls /var/lib/clickhouse/data/default/mt/201810_1_4_1 | wc -l
14
Production: Too much
$ ssh -A root@mtgiga075-2.metrika.yandex.net
# ls /var/lib/clickhouse/data/merge/visits_v2/202002_4462_4462_0 | wc -l
1556
48 / 84
52. Solution: Merges
M N N+1
Primary
key
Insert number
Part
[M,N]
Part
[N+1]
49 / 84
53. Solution: Merges
M N N+1
Background merge
[M,N] [N+1]
Primary
key
Part Part
Insert number
50 / 84
55. Properties of Merge
âș Each part participate in a single successful merge
âș Source parts became inactive
âș Addional logic during merge
52 / 84
56. Things to do while merging
Replace/update records
âș ReplacingMergeTree â replace
âș SummingMergeTree â sum
âș CollapsingMergeTree â fold
âș VersionedCollapsingMergeTree â fold rows + versioning
Pre-aggregate data
âș AggregatingMergeTree â merge aggregate function states
Metrics rollup
âș GraphiteMergeTree â rollup in graphite fashion
53 / 84
59. Partitioning
ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY ...
âș Logical entities (doesnât stored on disk)
âș Table can be partitioned by any expression (default: by month)
âș Parts from different partitions never merged
âș MinMax index by partition columns
âș Easy manipulation of partitions:
56 / 84
60. Partitioning
ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY ...
âș Logical entities (doesnât stored on disk)
âș Table can be partitioned by any expression (default: by month)
âș Parts from different partitions never merged
âș MinMax index by partition columns
âș Easy manipulation of partitions:
ALTER TABLE mt DROP PARTITION 201810
ALTER TABLE mt DETACH/ATTACH PARTITION 201809
57 / 84
61. Mutations
ALTER TABLE mt DELETE WHERE OrderID < 1205
ALTER TABLE mt UPDATE GoalNum = 3 WHERE BannerID = 235433;
Features
âș NOT designed for regular usage
âș Overwrite all touched parts on disk
âș Work in background
âș Original parts became inactive
58 / 84
84. Things to remember
Control total number of parts
âș Rate of INSERTs
Merging runs in the background
âș Even when there are no queries!
âș With additional fold logic
81 / 84
85. Things to remember
Control total number of parts
âș Rate of INSERTs
Merging runs in the background
âș Even when there are no queries!
âș With additional fold logic
Index is sparse
âș Must fit into memory
âș Determines order of data on disk
âș Using the index is always beneficial
82 / 84
86. Things to remember
Control total number of parts
âș Rate of INSERTs
Merging runs in the background
âș Even when there are no queries!
âș With additional fold logic
Index is sparse
âș Must fit into memory
âș Determines order of data on disk
âș Using the index is always beneficial
Partitions is logical entity
âș Easy manipulation with portions of data
âș Cannot improve SELECTs performance
83 / 84