Presented at the webinar, June 26, 2019
Materialized views are a killer feature of ClickHouse that can speed up queries 20X or more. Our webinar will teach you how to use this potent tool starting with how to create materialized views and load data. We'll then walk through cookbook examples to solve practical problems like deriving aggregates that outlive base data, answering last point queries, and using AggregateFunctions to handle problems like counting unique values, which is a special ClickHouse feature. There will be time for Q&A at the end. At that point you'll be a wizard of ClickHouse materialized views and able to cast spells of your own.
2. Quick Introduction to Altinity
â Premier provider of software and services for ClickHouse
â Enterprise support for ClickHouse and ecosystem projects
â ClickHouse Feature Development (Geohashing, codecs, tiered
storage, log cleansingâŠ)
â Software (Kubernetes, cluster manager, tools & utilities)
â POCs/Training
â Main US/Europe sponsor of ClickHouse community
3. Introduction to ClickHouse
Understands SQL
Runs on bare metal to cloud
Stores data in columns
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
Is WAY fast!
a b c d
a b c d
a b c d
a b c d
5. Computing sums on large datasets
Our favorite dataset!
1.31 billion rows of data
from taxi and limousine
rides in New York City
Question:
What is the average number
of persons per taxi in each
pickup location?
6. ClickHouse can solve it with a simple query
SELECT pickup_location_id,
sum(passenger_count) / pc_cnt AS pc_avg,
count() AS pc_cnt
FROM tripdata GROUP BY pickup_location_id
ORDER BY pc_cnt DESC LIMIT 10
...
10 rows in set. Elapsed: 1.818 sec. Processed 1.31
billion rows, 3.93 GB (721.16 million rows/s., 2.16
GB/s.)
Can we make it faster?
7. We can add a materialized view
CREATE MATERIALIZED VIEW tripdata_smt_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (pickup_location_id, dropoff_location_id)
POPULATE
AS SELECT
pickup_date,
pickup_location_id,
dropoff_location_id,
sum(passenger_count) AS passenger_count_sum,
count() AS trips_count
FROM tripdata
GROUP BY
pickup_date,
pickup_location_id,
dropoff_location_id
Engine
How to
derive
data
Auto-
load
after
create
How
to
sum
8. Now we can query the materialized view
SELECT pickup_location_id,
sum(passenger_count_sum) / sum(trips_count) AS pc_avg,
sum(trips_count) AS pc_cnt
FROM tripdata_smt_mv GROUP BY pickup_location_id
ORDER BY pc_cnt DESC LIMIT 10
...
10 rows in set. Elapsed: 0.016 sec. Processed 2.04
million rows, 36.66 MB (127.04 million rows/s., 2.29
GB/s.)
Materialized view is 113 times faster
9. Whatâs going on under the covers?
cpu
(MergeTree)
.inner.tripdata_smt_mv
(SummingMergeTree)
tripdata_smt_mv
(materialized view)
INSERT
SELECT
Compressed size: ~43GB
Uncompressed size: ~104GB
Compressed size: ~13MB
Uncompressed size: ~44MB
SELECT
INSERT
(Trigger)
11. Letâs look at aggregates other than sums
SELECT min(fare_amount), avg(fare_amount),
max(fare_amount), sum(fare_amount), count()
FROM tripdata
WHERE (fare_amount > 0) AND (fare_amount < 500.)
...
1 rows in set. Elapsed: 4.640 sec. Processed 1.31
billion rows, 5.24 GB (282.54 million rows/s., 1.13
GB/s.)
Can we make all aggregates faster?
12. Letâs create a view for all aggregates
CREATE MATERIALIZED VIEW tripdata_agg_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (pickup_location_id, dropoff_location_id)
POPULATE AS SELECT
pickup_date, pickup_location_id,
dropoff_location_id,
minState(fare_amount) as fare_amount_min,
avgState(fare_amount) as fare_amount_avg,
maxState(fare_amount) as fare_amount_max,
sumState(fare_amount) as fare_amount_sum,
countState() as fare_amount_count
FROM tripdata
WHERE (fare_amount > 0) AND (fare_amount < 500.)
GROUP BY
pickup_date, pickup_location_id, dropoff_location_id
Engine
How to
derive
data
Auto-
load
after
create
Agg.
order
Filter on
data
13. Digression: How aggregation works
fare_amount
avgState(fare_amount)
avgMerge(fare_amount_avg)
Source value
Partial
aggregate
Merged
aggregate
14. Now letâs select from the materialized view
SELECT minMerge(fare_amount_min) AS fare_amount_min,
avgMerge(fare_amount_avg) AS fare_amount_avg,
maxMerge(fare_amount_max) AS fare_amount_max,
sumMerge(fare_amount_sum) AS fare_amount_sum,
countMerge(fare_amount_count) AS fare_amount_count
FROM tripdata_agg_mv
...
1 rows in set. Elapsed: 0.017 sec. Processed 208.86
thousand rows, 23.88 MB (12.54 million rows/s., 1.43
GB/s.)
Materialized view is 274 times faster
16. Last point problems are common in time series
Host
7023
Host
6522
CPU
Utilization
Host
9601
CPU
Utilization
CPU
UtilizationCPU
Utilization
CPU
UtilizationCPU
Utilization
CPU Table
Problem: Show the
current CPU utilization for
each host
CPU
Utilization
CPU
Utilization
17. ClickHouse can solve this using a subquery
SELECT t.hostname, tags_id, 100 - usage_idle usage
FROM (
SELECT tags_id, usage_idle
FROM cpu
WHERE (tags_id, created_at) IN
(SELECT tags_id, max(created_at)
FROM cpu GROUP BY tags_id)
) AS c
INNER JOIN tags AS t ON c.tags_id = t.id
ORDER BY
usage DESC,
t.hostname ASC
LIMIT 10
TABLE SCAN!
USE INDEX
OPTIMIZED
JOIN COST
18. SQL queries work but are ineïŹcient
OUTPUT:
ââhostnameâââŹâtags_idââŹâusageââ
â host_1002 â 9003 â 100 â
â host_1116 â 9117 â 100 â
â host_1141 â 9142 â 100 â
â host_1163 â 9164 â 100 â
â host_1210 â 9211 â 100 â
â host_1216 â 9217 â 100 â
â host_1234 | 9235 â 100 â
â host_1308 â 9309 â 100 â
â host_1419 â 9420 â 100 â
â host_1491 â 9492 â 100 â
âââââââââââââŽââââââââââŽââââââââ
Using direct query on table:
10 rows in set. Elapsed: 0.566 sec.
Processed 32.87 million rows, 263.13
MB (53.19 million rows/s., 425.81
MB/s.)
Can we bring last
point performance
closer to real-time?
19. Create an explicit target for aggregate data
CREATE TABLE cpu_last_point_idle_agg (
created_date AggregateFunction(argMax, Date, DateTime),
max_created_at AggregateFunction(max, DateTime),
time AggregateFunction(argMax, String, DateTime),
tags_id UInt32,
usage_idle AggregateFunction(argMax, Float64, DateTime)
)
ENGINE = AggregatingMergeTree()
PARTITION BY tuple()
ORDER BY tags_id
TIP: This is a new way
to create materialized
views in ClickHouse
TIP: âtuple()â means
donât partition and
donât sort data
20. argMaxState links columns with aggregates
CREATE MATERIALIZED VIEW cpu_last_point_idle_mv
TO cpu_last_point_idle_agg
AS SELECT
argMaxState(created_date, created_at) AS created_date,
maxState(created_at) AS max_created_at,
argMaxState(time, created_at) AS time,
tags_id,
argMaxState(usage_idle, created_at) AS usage_idle
FROM cpu
GROUP BY tags_id
Derive data
MV
table
21. Understanding how argMaxState works
created_at
maxState(created_at)
avgMerge(created_at)
Source
values
Partial
aggregates
Merged
aggregates
usage_idle
argMaxState(usage_idle,
created_at)
avgMaxMerge(usage_idle)
(Same row)
(Pick usage_idle from
aggregate with
matching created_at)
(Pick usage_idle value
from any row with
matching created_at)
22. Load materialized view using a SELECT
INSERT INTO cpu_last_point_idle_mv
SELECT
argMaxState(created_date, created_at) AS created_date,
maxState(created_at) AS max_created_at,
argMaxState(time, created_at) AS time,
Tags_id,
argMaxState(usage_idle, created_at) AS usage_idle
FROM cpu
GROUP BY tags_id
POPULATE keyword is not supported
23. Letâs hide the merge details with a view
CREATE VIEW cpu_last_point_idle_v AS
SELECT
argMaxMerge(created_date) AS created_date,
maxMerge(max_created_at) AS created_at,
argMaxMerge(time) AS time,
tags_id,
argMaxMerge(usage_idle) AS usage_idle
FROM cpu_last_point_idle_mv
GROUP BY tags_id
24. ...Select again from the covering view
SELECT t.hostname, tags_id, 100 - usage_idle usage
FROM cpu_last_point_idle_v AS b
INNER JOIN tags AS t ON b.tags_id = t.id
ORDER BY usage DESC, t.hostname ASC
LIMIT 10
...
10 rows in set. Elapsed: 0.005 sec. Processed 14.00
thousand rows, 391.65 KB (2.97 million rows/s., 82.97
MB/s.)
Last point view is 113 times faster
25. Another view under the covers
cpu
(MergeTree)
cpu_last_point_idle_agg
(AggregatingMergeTree)
cpu_last_point_idle_mv
(Materialized View)
cpu_last_point_idle_v
(View)
INSERT
SELECT
SELECT
(Trigger)
Compressed size: ~3GB
Uncompressed size: ~12GB
Compressed size: ~29K (!!!)
Uncompressed size: ~264K (!!!)
27. Itâs common to drop old data in large tables
CREATE TABLE traffic (
datetime DateTime,
date Date,
request_id UInt64,
cust_id UInt32,
sku UInt32
) ENGINE = MergeTree
PARTITION BY toYYYYMM(datetime)
ORDER BY (cust_id, date)
TTL datetime + INTERVAL 90 DAY
28. ClickHouse can aggregate unique values
-- Find which hours have most unique visits on
-- SKU #100
SELECT
toStartOfHour(datetime) as hour,
uniq(cust_id) as uniqs
FROM purchases
WHERE sku = 100
GROUP BY hour
ORDER BY uniqs DESC, hour ASC
LIMIT 10
How can we remember deleted visit data?
29. Create a target table for unique visit data
CREATE TABLE traffic_uniqs_agg (
hour DateTime,
cust_id_uniqs AggregateFunction(uniq, UInt32),
sku UInt32
)
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (sku, hour)
Need to partition
Engine
Agg. order
Aggregate for
unique values
30. Use uniqState to collect unique customers
CREATE MATERIALIZED VIEW traffic_uniqs_mv
TO traffic_uniqs_agg
AS SELECT
toStartOfHour(datetime) as hour,
uniqState(cust_id) as cust_id_uniqs,
Sku
FROM traffic
GROUP BY hour, sku
31. Now we can remember visitors indeïŹnitely!
SELECT
hour,
uniqMerge(cust_id_uniqs) as uniqs
FROM traffic_uniqs_mv
WHERE sku = 100
GROUP BY hour
ORDER BY uniqs DESC, hour ASC
LIMIT 10
32. Check view size regularly!
SELECT
table, count(*) AS columns,
sum(data_compressed_bytes) AS tc,
sum(data_uncompressed_bytes) AS tu
FROM system.columns
WHERE database = currentDatabase()
GROUP BY table
ââtableââââââââââââââŹâcolumnsââŹââââââââtcââŹâââââââââtuââ
â traffic_uniqs_agg â 3 â 52620883 â 90630442 â
â traffic â 5 â 626152255 â 1463621588 â
â traffic_uniqs_mv â 3 â 0 â 0 â
âââââââââââââââââââââŽââââââââââŽââââââââââââŽâââââââââââââ
Uniq hash
tables can be
quite large
34. Load past data in the future
CREATE MATERIALIZED VIEW sales_amount_mv
TO sales_amount_agg
AS SELECT
toStartOfHour(datetime) as hour,
sumState(amount) as amount_sum,
sku
FROM sales
WHERE
datetime >= toDateTime('2019-07-12 00:00:00')
GROUP BY hour, sku
How to
derive
data
Accept only future
rows from source
35. Creating a view that accepts future data
INSERT INTO sales_amount_agg
SELECT
toStartOfHour(datetime) as hour,
sumState(amount) as amount_sum,
sku
FROM sales
WHERE
datetime < toDateTime('2019-01-12 00:00:00')
GROUP BY hour, sku
Load to
view target
table
Accept only past
rows from source
36. Add an aggregate to MV with inner table
-- Detach view from server.
DETACH TABLE sales_amount_inner_mv
-- Update base table
ALTER TABLE `.inner.sales_amount_inner_mv` ADD COLUMN
`amount_avg` AggregateFunction(avg, Float32)
AFTER amount_sum
Check order
carefully!
TIP: Stop source
INSERTs to avoid
losing data or load
data later
37. Add column to MV with inner table (cont.)
-- Re-attach view with modifications.
ATTACH MATERIALIZED VIEW sales_amount_inner_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (sku, hour)
AS SELECT
toStartOfHour(datetime) as hour,
sumState(amount) as amount_sum,
avgState(amount) as amount_avg,
sku
FROM sales
GROUP BY hour, sku
Empty until
new data
arrive
Specify table
38. âTOâ option views are easier to change
-- Drop view
DROP TABLE sales_amount_mv
-- Update target table
ALTER TABLE sales_amount_agg ADD COLUMN
`amount_avg` AggregateFunction(avg, Float32)
AFTER amount_sum
-- Recreate view
CREATE MATERIALIZED VIEW sales_amount_mv TO sales_amount_agg
AS SELECT toStartOfHour(datetime) as hour,
sumState(amount) as amount_sum,
avgState(amount) as amount_avg,
sku
FROM sales GROUP BY hour, sku
Empty until
new data
arrive
39. Letâs add a dimension to the view
-- Drop view
DROP TABLE sales_amount_mv
-- Update target table
ALTER TABLE sales_amount_agg
ADD COLUMN cust_id UInt32 AFTER sku,
MODIFY ORDER BY (sku, hour, cust_id)
-- Recreate view
CREATE MATERIALIZED VIEW sales_amount_mv TO sales_amount_agg
AS SELECT
toStartOfHour(datetime) as hour,
sumState(amount) as amount_sum,
avgState(amount) as amount_avg,
sku, cust_id
FROM sales GROUP BY hour, sku, cust_id
Empty until
new data
arrive
41. Key takeaways
â Materialized Views are like automatic triggers
â Move data from source table to target
â They solve important problems within the database:
â Speed up query by pre-computing aggregates
â Make âlast pointâ queries extremely fast
â Remember aggregates after source rows are dropped
â AggregateFunctions hold partially aggregated data
â Functions like avgState(X) to put data into the view
â Functions like avgMerge(X) to get results out
â Functions like argMaxState(X, Y) link values across columns
â Two âstylesâ of materialized views
â TO external table or using implicit âinnerâ table as target
â Loading with ïŹlters avoids data loss on busy systems
â Schema migration operations are easier with âTOâ style of MV
42. But wait, thereâs even more!
â [Many] More use cases for views
â Different sort orders
â Using views as indexes
â Data transformation
â More about table engines and materialized views
â Failures
â Effects of sync behavior when there are multiple views per table
â Common materialized view problems
â More tips on schema migration
43. Thank you!
Special Offer:
Contact us for a
1-hour consultation!
Contacts:
info@altinity.com
Visit us at:
https://www.altinity.com
Free Consultation:
https://blog.altinity.com/offer