ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity Engineering Team

CLICKHOUSE AND THE
MAGIC OF
MATERIALIZED VIEWS
Robert Hodges and the Altinity Engineering Team

Quick Introduction to Altinity
● Premier provider of software and services for ClickHouse
○ Enterprise support for ClickHouse and ecosystem projects
○ ClickHouse Feature Development (Geohashing, codecs, tiered
storage, log cleansing…)
○ Software (Kubernetes, cluster manager, tools & utilities)
○ POCs/Training
● Main US/Europe sponsor of ClickHouse community

Introduction to ClickHouse
Understands SQL
Runs on bare metal to cloud
Stores data in columns
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
Is WAY fast!
a b c d
a b c d
a b c d
a b c d

Materialized
views for basic
sums

Computing sums on large datasets
Our favorite dataset!
1.31 billion rows of data
from taxi and limousine
rides in New York City
Question:
What is the average number
of persons per taxi in each
pickup location?

ClickHouse can solve it with a simple query
SELECT pickup_location_id,
sum(passenger_count) / pc_cnt AS pc_avg,
count() AS pc_cnt
FROM tripdata GROUP BY pickup_location_id
ORDER BY pc_cnt DESC LIMIT 10
...
10 rows in set. Elapsed: 1.818 sec. Processed 1.31
billion rows, 3.93 GB (721.16 million rows/s., 2.16
GB/s.)
Can we make it faster?

We can add a materialized view
CREATE MATERIALIZED VIEW tripdata_smt_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (pickup_location_id, dropoff_location_id)
POPULATE
AS SELECT
pickup_date,
pickup_location_id,
dropoff_location_id,
sum(passenger_count) AS passenger_count_sum,
count() AS trips_count
FROM tripdata
GROUP BY
pickup_date,
pickup_location_id,
dropoff_location_id
Engine
How to
derive
data
Auto-
load
after
create
How
to
sum

Now we can query the materialized view
SELECT pickup_location_id,
sum(passenger_count_sum) / sum(trips_count) AS pc_avg,
sum(trips_count) AS pc_cnt
FROM tripdata_smt_mv GROUP BY pickup_location_id
ORDER BY pc_cnt DESC LIMIT 10
...
million rows, 36.66 MB (127.04 million rows/s., 2.29
GB/s.)
Materialized view is 113 times faster

What’s going on under the covers?
cpu
(MergeTree)
.inner.tripdata_smt_mv
(SummingMergeTree)
tripdata_smt_mv
(materialized view)
INSERT
SELECT
Compressed size: ~43GB
Uncompressed size: ~104GB
Compressed size: ~13MB
Uncompressed size: ~44MB
SELECT
INSERT
(Trigger)

Materialized
views for SQL
aggregates

Let’s look at aggregates other than sums
SELECT min(fare_amount), avg(fare_amount),
max(fare_amount), sum(fare_amount), count()
FROM tripdata
WHERE (fare_amount > 0) AND (fare_amount < 500.)
...
billion rows, 5.24 GB (282.54 million rows/s., 1.13
GB/s.)
Can we make all aggregates faster?

Let’s create a view for all aggregates
CREATE MATERIALIZED VIEW tripdata_agg_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (pickup_location_id, dropoff_location_id)
POPULATE AS SELECT
pickup_date, pickup_location_id,
dropoff_location_id,
minState(fare_amount) as fare_amount_min,
avgState(fare_amount) as fare_amount_avg,
maxState(fare_amount) as fare_amount_max,
sumState(fare_amount) as fare_amount_sum,
countState() as fare_amount_count
FROM tripdata
WHERE (fare_amount > 0) AND (fare_amount < 500.)
GROUP BY
pickup_date, pickup_location_id, dropoff_location_id
Engine
How to
derive
data
Auto-
load
after
create
Agg.
order
Filter on
data

Digression: How aggregation works
fare_amount
avgState(fare_amount)
avgMerge(fare_amount_avg)
Source value
Partial
aggregate
Merged
aggregate

Now let’s select from the materialized view
SELECT minMerge(fare_amount_min) AS fare_amount_min,
avgMerge(fare_amount_avg) AS fare_amount_avg,
maxMerge(fare_amount_max) AS fare_amount_max,
sumMerge(fare_amount_sum) AS fare_amount_sum,
countMerge(fare_amount_count) AS fare_amount_count
FROM tripdata_agg_mv
...
thousand rows, 23.88 MB (12.54 million rows/s., 1.43
GB/s.)
Materialized view is 274 times faster

Materialized
views for “last
point queries”

Last point problems are common in time series
Host
7023
Host
6522
CPU
Utilization
Host
9601
CPU
Utilization
CPU
UtilizationCPU
Utilization
CPU
UtilizationCPU
Utilization
CPU Table
Problem: Show the
current CPU utilization for
each host
CPU
Utilization
CPU
Utilization

ClickHouse can solve this using a subquery
SELECT t.hostname, tags_id, 100 - usage_idle usage
FROM (
SELECT tags_id, usage_idle
FROM cpu
WHERE (tags_id, created_at) IN
(SELECT tags_id, max(created_at)
FROM cpu GROUP BY tags_id)
) AS c
INNER JOIN tags AS t ON c.tags_id = t.id
ORDER BY
usage DESC,
t.hostname ASC
LIMIT 10
TABLE SCAN!
USE INDEX
OPTIMIZED
JOIN COST

SQL queries work but are ineﬃcient
OUTPUT:
┌─hostname──┬─tags_id─┬─usage─┐
│ host_1002 │ 9003 │ 100 │
│ host_1116 │ 9117 │ 100 │
│ host_1141 │ 9142 │ 100 │
│ host_1163 │ 9164 │ 100 │
│ host_1210 │ 9211 │ 100 │
│ host_1216 │ 9217 │ 100 │
│ host_1234 | 9235 │ 100 │
│ host_1308 │ 9309 │ 100 │
│ host_1419 │ 9420 │ 100 │
│ host_1491 │ 9492 │ 100 │
└───────────┴─────────┴───────┘
Using direct query on table:
10 rows in set. Elapsed: 0.566 sec.
Processed 32.87 million rows, 263.13
MB (53.19 million rows/s., 425.81
MB/s.)
Can we bring last
point performance
closer to real-time?

Create an explicit target for aggregate data
CREATE TABLE cpu_last_point_idle_agg (
created_date AggregateFunction(argMax, Date, DateTime),
max_created_at AggregateFunction(max, DateTime),
time AggregateFunction(argMax, String, DateTime),
tags_id UInt32,
usage_idle AggregateFunction(argMax, Float64, DateTime)
)
ENGINE = AggregatingMergeTree()
PARTITION BY tuple()
ORDER BY tags_id
TIP: This is a new way
to create materialized
views in ClickHouse
TIP: ‘tuple()’ means
don’t partition and
don’t sort data

argMaxState links columns with aggregates
CREATE MATERIALIZED VIEW cpu_last_point_idle_mv
TO cpu_last_point_idle_agg
AS SELECT
argMaxState(created_date, created_at) AS created_date,
maxState(created_at) AS max_created_at,
argMaxState(time, created_at) AS time,
tags_id,
argMaxState(usage_idle, created_at) AS usage_idle
FROM cpu
GROUP BY tags_id
Derive data
MV
table

Understanding how argMaxState works
created_at
maxState(created_at)
avgMerge(created_at)
Source
values
Partial
aggregates
Merged
aggregates
usage_idle
argMaxState(usage_idle,
created_at)
avgMaxMerge(usage_idle)
(Same row)
(Pick usage_idle from
aggregate with
matching created_at)
(Pick usage_idle value
from any row with
matching created_at)

Load materialized view using a SELECT
INSERT INTO cpu_last_point_idle_mv
SELECT
argMaxState(created_date, created_at) AS created_date,
maxState(created_at) AS max_created_at,
argMaxState(time, created_at) AS time,
Tags_id,
argMaxState(usage_idle, created_at) AS usage_idle
FROM cpu
GROUP BY tags_id
POPULATE keyword is not supported

Let’s hide the merge details with a view
CREATE VIEW cpu_last_point_idle_v AS
SELECT
argMaxMerge(created_date) AS created_date,
maxMerge(max_created_at) AS created_at,
argMaxMerge(time) AS time,
tags_id,
argMaxMerge(usage_idle) AS usage_idle
FROM cpu_last_point_idle_mv
GROUP BY tags_id

...Select again from the covering view
SELECT t.hostname, tags_id, 100 - usage_idle usage
FROM cpu_last_point_idle_v AS b
INNER JOIN tags AS t ON b.tags_id = t.id
ORDER BY usage DESC, t.hostname ASC
LIMIT 10
...
thousand rows, 391.65 KB (2.97 million rows/s., 82.97
MB/s.)
Last point view is 113 times faster

Another view under the covers
cpu
(MergeTree)
cpu_last_point_idle_agg
(AggregatingMergeTree)
cpu_last_point_idle_mv
(Materialized View)
cpu_last_point_idle_v
(View)
INSERT
SELECT
SELECT
(Trigger)
Compressed size: ~3GB
Uncompressed size: ~12GB
Compressed size: ~29K (!!!)
Uncompressed size: ~264K (!!!)

Using views to
remember
unique visitors

It’s common to drop old data in large tables
CREATE TABLE traffic (
datetime DateTime,
date Date,
request_id UInt64,
cust_id UInt32,
sku UInt32
) ENGINE = MergeTree
PARTITION BY toYYYYMM(datetime)
ORDER BY (cust_id, date)
TTL datetime + INTERVAL 90 DAY

ClickHouse can aggregate unique values
-- Find which hours have most unique visits on
-- SKU #100
SELECT
toStartOfHour(datetime) as hour,
uniq(cust_id) as uniqs
FROM purchases
WHERE sku = 100
GROUP BY hour
ORDER BY uniqs DESC, hour ASC
LIMIT 10
How can we remember deleted visit data?

Create a target table for unique visit data
CREATE TABLE traffic_uniqs_agg (
hour DateTime,
cust_id_uniqs AggregateFunction(uniq, UInt32),
sku UInt32
)
PARTITION BY toYYYYMM(hour)
ORDER BY (sku, hour)
Need to partition
Engine
Agg. order
Aggregate for
unique values

Use uniqState to collect unique customers
CREATE MATERIALIZED VIEW traffic_uniqs_mv
TO traffic_uniqs_agg
AS SELECT
uniqState(cust_id) as cust_id_uniqs,
Sku
FROM traffic
GROUP BY hour, sku

Now we can remember visitors indeﬁnitely!
SELECT
hour,
uniqMerge(cust_id_uniqs) as uniqs
FROM traffic_uniqs_mv
WHERE sku = 100
GROUP BY hour
ORDER BY uniqs DESC, hour ASC
LIMIT 10

Check view size regularly!
SELECT
table, count(*) AS columns,
sum(data_compressed_bytes) AS tc,
sum(data_uncompressed_bytes) AS tu
FROM system.columns
WHERE database = currentDatabase()
GROUP BY table
┌─table─────────────┬─columns─┬────────tc─┬─────────tu─┐
│ traffic_uniqs_agg │ 3 │ 52620883 │ 90630442 │
│ traffic │ 5 │ 626152255 │ 1463621588 │
│ traffic_uniqs_mv │ 3 │ 0 │ 0 │
└───────────────────┴─────────┴───────────┴────────────┘
Uniq hash
tables can be
quite large

Operational
Issues
Loading view data in production
Schema migration

Load past data in the future
CREATE MATERIALIZED VIEW sales_amount_mv
TO sales_amount_agg
AS SELECT
sumState(amount) as amount_sum,
sku
FROM sales
WHERE
datetime >= toDateTime('2019-07-12 00:00:00')
GROUP BY hour, sku
How to
derive
data
Accept only future
rows from source

Creating a view that accepts future data
INSERT INTO sales_amount_agg
SELECT
sku
FROM sales
WHERE
datetime < toDateTime('2019-01-12 00:00:00')
GROUP BY hour, sku
Load to
view target
table
Accept only past
rows from source

Add an aggregate to MV with inner table
-- Detach view from server.
DETACH TABLE sales_amount_inner_mv
-- Update base table
ALTER TABLE `.inner.sales_amount_inner_mv` ADD COLUMN
`amount_avg` AggregateFunction(avg, Float32)
AFTER amount_sum
Check order
carefully!
TIP: Stop source
INSERTs to avoid
losing data or load
data later

Add column to MV with inner table (cont.)
-- Re-attach view with modifications.
ATTACH MATERIALIZED VIEW sales_amount_inner_mv
PARTITION BY toYYYYMM(hour)
ORDER BY (sku, hour)
AS SELECT
avgState(amount) as amount_avg,
sku
FROM sales
GROUP BY hour, sku
Empty until
new data
arrive
Specify table

‘TO’ option views are easier to change
-- Drop view
DROP TABLE sales_amount_mv
-- Update target table
ALTER TABLE sales_amount_agg ADD COLUMN
`amount_avg` AggregateFunction(avg, Float32)
AFTER amount_sum
-- Recreate view
CREATE MATERIALIZED VIEW sales_amount_mv TO sales_amount_agg
AS SELECT toStartOfHour(datetime) as hour,
sku
FROM sales GROUP BY hour, sku
Empty until
new data
arrive

Let’s add a dimension to the view
-- Drop view
DROP TABLE sales_amount_mv
-- Update target table
ALTER TABLE sales_amount_agg
ADD COLUMN cust_id UInt32 AFTER sku,
MODIFY ORDER BY (sku, hour, cust_id)
-- Recreate view
CREATE MATERIALIZED VIEW sales_amount_mv TO sales_amount_agg
AS SELECT
sku, cust_id
FROM sales GROUP BY hour, sku, cust_id
Empty until
new data
arrive

Key takeaways
● Materialized Views are like automatic triggers
○ Move data from source table to target
● They solve important problems within the database:
○ Speed up query by pre-computing aggregates
○ Make ‘last point’ queries extremely fast
○ Remember aggregates after source rows are dropped
● AggregateFunctions hold partially aggregated data
○ Functions like avgState(X) to put data into the view
○ Functions like avgMerge(X) to get results out
○ Functions like argMaxState(X, Y) link values across columns
● Two “styles” of materialized views
○ TO external table or using implicit ‘inner’ table as target
● Loading with ﬁlters avoids data loss on busy systems
● Schema migration operations are easier with ‘TO’ style of MV

But wait, there’s even more!
● [Many] More use cases for views
○ Different sort orders
○ Using views as indexes
○ Data transformation
● More about table engines and materialized views
● Failures
○ Effects of sync behavior when there are multiple views per table
○ Common materialized view problems
● More tips on schema migration

Thank you!
Special Offer:
Contact us for a
1-hour consultation!
Contacts:
info@altinity.com
Visit us at:
https://www.altinity.com
Free Consultation:
https://blog.altinity.com/offer

ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity Engineering Team

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity Engineering Team

Ähnlich wie ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity Engineering Team (20)

Mehr von Altinity Ltd

Mehr von Altinity Ltd (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity Engineering Team