Talk for pre-Fosdem MySQL Day on February 1, 2019.
Last year I worked on several tickets where data follow the same pattern: millions of popular products fit into a couple of categories and rest used the rest. We had a hard time to find a solution for retrieving goods fast.
MySQL 8.0 has a feature which resolves such issues: optimizer histograms, storing statistics of an exact number of values in each data bucket.
However in real life histograms help not with all queries, accessing non-uniform data. How you write a query, the number of rows in the table, data distribution: all these may affect the use of histograms.
In this session I show examples, demonstrating how Optimizer uses histograms.
2. • MySQL Support engineer
• Author of
• MySQL Troubleshooting
• JSON UDF functions
• FILTER clause for MySQL
• Speaker
• Percona Live, OOW, Fosdem,
DevConf, HighLoad...
Sveta Smirnova
2
3. •Why do I Care?
•The Use Case
•Even Worse Use Case
•Why the Difference?
•How Histograms Work?
Table of Contents
3
4. The column statistics data dictionary table stores histogram statistics about
column values, for use by the optimizer in constructing query execution plans
MySQL User Reference Manual
Optimizer Statistics aka Histograms
4
6. • Data distribution vary
•
Big difference between number of values
•
Costantly changing
Latest Support Tickets
6
7. • Data distribution vary
• Cardinality is not correct
• Was not updated in time
•
Updates too often
• Calculated wrongly
Latest Support Tickets
6
8. • Data distribution vary
• Cardinality is not correct
• Index maintenance costs a lot
• Hardware resources
•
Slow updates
• Window to run CREATE INDEX
Latest Support Tickets
6
9. • Data distribution vary
• Cardinality is not correct
• Index maintenance costs a lot
• Optimizer does not work as we wish to
Examples in my talk @Percona Live
Latest Support Tickets
6
10. • Topic based on real Support cases
•
Couple of them are still in progress
Disclaimer
7
11. • Topic based on real Support cases
• All examples are 100% fake
•
They created such that
• No customer can be identified
• Everything generated
Table names
Column names
Data
• Use case itself is fictional
Disclaimer
7
12. • Topic based on real Support cases
• All examples are 100% fake
• All examples are simplified
• Only columns, required to show the issue
•
Everything extra removed
• Real tables usually store much more data
Disclaimer
7
13. • Topic based on real Support cases
• All examples are 100% fake
• All examples are simplified
• All disasters happened with version 5.7
Disclaimer
7
16. •
categories
• Less than 20 rows
• goods
• More than 1M rows
• 20 unique cat id values
• Many other fields
Price
Date: added, last updated, etc.
Characteristics
Store
...
Two tables
9
18. • Select from the Small Table
Option 1: Select from the Small Table First
11
19. • Select from the Small Table
• For each cat id select from the large table
Option 1: Select from the Small Table First
11
20. • Select from the Small Table
• For each cat id select from the large table
• Filter result on date added[ and price[...]]
Option 1: Select from the Small Table First
11
21. • Select from the Small Table
• For each cat id select from the large table
• Filter result on date added[ and price[...]]
• Slow with many items in the category
Option 1: Select from the Small Table First
11
22. • Filter rows by date added[ and price[...]]
Option 2: Select from the Large Table First
12
23. • Filter rows by date added[ and price[...]]
• Get cat id values
Option 2: Select from the Large Table First
12
24. • Filter rows by date added[ and price[...]]
• Get cat id values
• Retrieve rows from the small table
Option 2: Select from the Large Table First
12
25. • Filter rows by date added[ and price[...]]
• Get cat id values
• Retrieve rows from the small table
• Slow if number of rows, filtered by
date added, is larger than number of goods in
the selected categories
Option 2: Select from the Large Table First
12
26. •
CREATE INDEX index everything
(cat id, date added[, price[, ...]])
• It resolves the issue
What if use Combined Indexes?
13
27. •
CREATE INDEX index everything
(cat id, date added[, price[, ...]])
• It resolves the issue
• But not in all cases
What if use Combined Indexes?
13
29. • Maintenance cost
•
Slower INSERT/UPDATE/DELETE
• Disk space
• Index not useful for selecting rows
JOIN categories ON (categories.id=goods.cat_id)
JOIN shops ON (shops.id=goods.shop_id)
[ JOIN ... ]
WHERE
date_added between ’2018-07-01’ and ’2018-08-01’
AND
cat_id in (16,11) AND price >= 1000 AND price <=10000 [ AND ... ]
GROUP BY product_type
ORDER BY date_updated DESC
LIMIT 50,100
The Problem
14
30. • Maintenance cost
•
Slower INSERT/UPDATE/DELETE
• Disk space
• Index not useful for selecting rows
• Tables may have wrong cardinality
The Problem
14
31. • EXPLAIN without histograms
mysql> explain select goods.* from goods
-> join categories on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’ -- Large range
-> order by goods.cat_id
-> limit 10G -- We ask for 10 rows only!
Example
15
32. • EXPLAIN without histograms
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: categories -- Small table first
partitions: NULL
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 20
filtered: 70.00
Extra: Using where; Using index;
Using temporary; Using filesort
Example
15
33. • EXPLAIN without histograms
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: goods -- Large table
partitions: NULL
type: ref
possible_keys: cat_id_2
key: cat_id_2
key_len: 5
ref: orig.categories.id
rows: 51827
filtered: 11.11 -- Default value
Extra: Using where
2 rows in set, 1 warning (0.01 sec)
Example
15
34. • Execution time without histograms
mysql> flush status;
Query OK, 0 rows affected (0.00 sec)
mysql> select goods.* from goods
-> join categories on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’
-> order by goods.cat_id
-> limit 10;
ab9f9bb7bc4f357712ec34f067eda364 -
10 rows in set (56.47 sec)
Example
15
35. • Engine statistics without histograms
mysql> show status like ’Handler%’;
+----------------------------+--------+
| Variable_name | Value |
+----------------------------+--------+
...
| Handler_read_next | 964718 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 10 |
| Handler_read_rnd_next | 951671 |
...
| Handler_write | 951670 |
+----------------------------+--------+
18 rows in set (0.01 sec)
Example
15
36. • Now lets add the histogram
mysql> analyze table goods update histogram on date_added;
+------------+-----------+----------+------------------------------+
| Table | Op | Msg_type | Msg_text |
+------------+-----------+----------+------------------------------+
| orig.goods | histogram | status | Histogram statistics created
for column ’date_added’. |
+------------+-----------+----------+------------------------------+
1 row in set (2.01 sec)
Example
15
37. • EXPLAIN with the histogram
mysql> explain select goods.* from goods
-> join categories
-> on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’
-> order by goods.cat_id
-> limit 10G
Example
15
38. • EXPLAIN with the histogram
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: goods -- Large table first
partitions: NULL
type: index
possible_keys: cat_id_2
key: cat_id_2
key_len: 5
ref: NULL
rows: 10 -- Same as we asked
filtered: 98.70 -- True numbers
Extra: Using where
Example
15
39. • EXPLAIN with the histogram
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: categories -- Small table
partitions: NULL
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: orig.goods.cat_id
rows: 1
filtered: 100.00
Extra: Using index
2 rows in set, 1 warning (0.01 sec)
Example
15
40. • Execution time with the histogram
mysql> flush status;
Query OK, 0 rows affected (0.00 sec)
mysql> select goods.* from goods
-> join categories on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’
-> order by goods.cat_id
-> limit 10;
eeb005fae0dd3441c5c380e1d87fee84 -
10 rows in set (0.00 sec) -- 56/0 times faster!
Example
15
47. • Data Distribution: goods characteristics
mysql> select count(*) num_rows, good_id, manufacturer
-> from goods_characteristics group by good_id, manufacturer order by num_rows desc;
+----------+---------+--------------+
| num_rows | good_id | manufacturer |
+----------+---------+--------------+
| 65536 | laptop | Noname |
| 8191 | laptop | Samsung |
| 8191 | laptop | Acer |
| 8189 | laptop | Dell |
| 8189 | laptop | HP |
| 8189 | laptop | Lenovo |
| 8189 | laptop | Toshiba |
| 8189 | laptop | Apple |
| 8189 | laptop | Asus |
| 10 | laptop | Sony |
| 10 | laptop | Casper |
+----------+---------+--------------+
Two Similar Tables
17
48. • Data Distribution: goods shops
mysql> select count(*) num_rows, good_id, location
-> from goods_shops group by good_id, location order by num_rows desc;
+----------+---------+---------------+
| num_rows | good_id | location |
+----------+---------+---------------+
| 8191 | laptop | New York |
| 8191 | laptop | San Francisco |
| 8189 | laptop | Paris |
| 8189 | laptop | Berlin |
| 8189 | laptop | Brussels |
| 8189 | laptop | Tokio |
| 8189 | laptop | Istanbul |
| 8189 | laptop | London |
| 10 | laptop | Moscow |
| 10 | laptop | Kiev |
+----------+---------+---------------+
Two Similar Tables
17
49. • Data Distribution: goods shops
mysql> select count(*) num_rows, good_id, delivery_options
-> from goods_shops group by good_id, delivery_options order by num_rows desc;
+----------+---------+------------------+
| num_rows | good_id | delivery_options |
+----------+---------+------------------+
| 8192 | laptop | DHL |
| 8191 | laptop | PTT |
| 8190 | laptop | Normal Post |
| 8190 | laptop | Tracked |
| 8189 | laptop | Fedex |
| 8189 | laptop | Gruzovichkof |
| 8188 | laptop | Courier |
| 8187 | laptop | No delivery |
| 10 | laptop | Premium |
| 10 | laptop | Urgent |
+----------+---------+------------------+
Two Similar Tables
17
50. Histogram statistics are useful primarily for nonindexed columns. Adding an
index to a column for which histogram statistics are applicable might also help
the optimizer make row estimates. The tradeoffs are:
An index must be updated when table data is modified.
A histogram is created or updated only on demand, so it adds no overhead
when table data is modified. On the other hand, the statistics become progres-
sively more out of date when table modifications occur, until the next time they
are updated.
MySQL User Reference Manual
Optimizer Statistics aka Histograms
18
51. mysql> alter table goods_characteristics stats_sample_pages=5000;
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> alter table goods_shops stats_sample_pages=5000;
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> analyze table goods_characteristics, goods_shops;
+----------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------------------------+---------+----------+----------+
| test.goods_characteristics | analyze | status | OK |
| test.goods_shops | analyze | status | OK |
+----------------------------+---------+----------+----------+
2 rows in set (0.35 sec)
Index Statistics is More than Good
19
52. • The query
mysql> select count(*) from goods_shops join goods_characteristics using (good_id)
-> where size < 12 and manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or delivery_options in (’Premium’, ’Urgent’));
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
Performance?
20
54. • Table order
mysql> explain select count(*) from goods_shops join goods_characteristics using (good_id)
-> where size < 12 and manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or delivery_options in (’Premium’, ’Urgent’));
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| id | table | type | key | rows | filtered | Extra |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| 1 | goods_characteristics | index | good_id | 131072 | 25.00 | Using where; Using index |
| 1 | goods_shops | ref | good_id | 65536 | 36.00 | Using where; Using index |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)
Performance?
20
55. • Table order matters
mysql> explain select count(*) from goods_shops straight_join goods_characteristics
-> using (good_id)
-> where size < 12 and manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or delivery_options in (’Premium’, ’Urgent’));
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| id | table | type | key | rows | filtered | Extra |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| 1 | goods_shops | index | good_id | 65536 | 36.00 | Using where; Using index |
| 1 | goods_characteristics | ref | good_id | 131072 | 25.00 | Using where; Using index |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)
Performance?
20
56. • Table order matters
mysql> select count(*) from goods_shops straight_join goods_characteristics using (good_id)
-> where size < 12 and manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or delivery_options in (’Premium’, ’Urgent’));
+----------+
| count(*) |
+----------+
| 816640 |
+----------+
1 row in set (2.11 sec)
mysql> show status like ’Handler_read_next’;
+-------------------+-----------+
| Variable_name | Value |
+-------------------+-----------+
| Handler_read_next | 5,308,416 |
+-------------------+-----------+
1 row in set (0.00 sec)
Performance?
20
57. mysql> analyze table goods_shops update histogram on location, delivery_options;
+-------------+-----------+----------+-----------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+-------------+-----------+----------+-----------------------------------------------------+
| goods_shops | histogram | status | Histogram statistics created... ’delivery_options’. |
| goods_shops | histogram | status | Histogram statistics created for column ’location’. |
+-------------+-----------+----------+-----------------------------------------------------+
2 rows in set (0.18 sec)
mysql> analyze table goods_characteristics update histogram on size, manufacturer ;
+-----------------------+-----------+----------+-------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------------------+-----------+----------+-------------------------------------------------+
| goods_characteristics | histogram | status | Histogram statistics created... ’manufacturer’. |
| goods_characteristics | histogram | status | Histogram statistics created for column ’size’. |
+-----------------------+-----------+----------+-------------------------------------------------+
2 rows in set (0.23 sec)
Histograms to Rescue
21
58. • The query
mysql> select count(*) from goods_shops join goods_characteristics using (good_id)
-> where size < 12 and manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or delivery_options in (’Premium’, ’Urgent’));
+----------+
| count(*) |
+----------+
| 816640 |
+----------+
1 row in set (2.16 sec)
mysql> show status like ’Handler_read_next’;
+-------------------+-----------+
| Variable_name | Value |
+-------------------+-----------+
| Handler_read_next | 5,308,418 |
+-------------------+-----------+
1 row in set (0.00 sec)
Histograms to Rescue
21
59. • Filtering effect
mysql> explain select count(*) from goods_shops join goods_characteristics using (good_id) where s
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| id | table | type | key | rows | filtered | Extra |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| 1 | goods_shops | index | good_id | 65536 | 0.06 | Using where; Using index |
| 1 | goods_characteristics | ref | good_id | 131072 | 15.63 | Using where; Using index |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)
Histograms to Rescue
21
70. ↓ sql/sql planner.cc
↓ calculate condition filter
↓ Item func *::get filtering effect
• get histogram selectivity
• Seen as a percent of filtered rows in EXPLAIN
Low Level
28
71. • Example data
mysql> create table example(f1 int) engine=innodb;
mysql> insert into example values(1),(1),(1),(2),(3);
mysql> select f1, count(f1) from example group by f1;
+------+-----------+
| f1 | count(f1) |
+------+-----------+
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
+------+-----------+
3 rows in set (0.00 sec)
Filtered Rows
29
72. • Without a histogram
mysql> explain select * from example where f1 > 0G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
29
73. • Without a histogram
mysql> explain select * from example where f1 > 1G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
29
74. • Without a histogram
mysql> explain select * from example where f1 > 2G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
29
75. • Without a histogram
mysql> explain select * from example where f1 > 3G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
29
76. • With the histogram
mysql> analyze table example update histogram on f1 with 3 buckets;
+-----------------+-----------+----------+------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------------+-----------+----------+------------------------------+
| hist_ex.example | histogram | status | Histogram statistics created
for column ’f1’. |
+-----------------+-----------+----------+------------------------------+
1 row in set (0.03 sec)
Filtered Rows
29
77. • With the histogram
mysql> select * from information_schema.column_statistics
-> where table_name=’example’G
*************************** 1. row ***************************
SCHEMA_NAME: hist_ex
TABLE_NAME: example
COLUMN_NAME: f1
HISTOGRAM:
"buckets": [[1, 0.6], [2, 0.8], [3, 1.0]],
"data-type": "int", "null-values": 0.0, "collation-id": 8,
"last-updated": "2018-11-07 09:07:19.791470",
"sampling-rate": 1.0, "histogram-type": "singleton",
"number-of-buckets-specified": 3
1 row in set (0.00 sec)
Filtered Rows
29
78. • With the histogram
mysql> explain select * from example where f1 > 0G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 100.00 -- all rows
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
29
79. • With the histogram
mysql> explain select * from example where f1 > 1G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 40.00 -- 2 rows
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
29
80. • With the histogram
mysql> explain select * from example where f1 > 2G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 20.00 -- one row
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
29
81. • With the histogram
mysql> explain select * from example where f1 > 3G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 20.00 - one row
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
29
83. •
CREATE INDEX
• Metadata lock
•
Can be blocked by any query
• UPDATE HISTOGRAM
• Backup lock
• Can be locked only by a backup
•
Can be created any time without fear
Locking
30
84. • Helps if query plan can be changed
• Not a replacement for the index:
•
GROUP BY
• ORDER BY
• Query on a single table ∗
Outcome
31
85. • Data distribution is uniform
• Range optimization can be used
• Full table scan is fast
When Histogram are not Helpful?
32
86. • Index statistics collected by the engine
• Optimizer calculates Cardinality each time
when accesses statistics
•
Indexes not always improve performance
• Histograms can help
Still new feature
• Histograms do not replace other optimizations!
Conclusion
33
87. MySQL User Reference Manual
Blog by Erik Froseth
Blog by Frederic Descamps
Talk by Oystein Grovlen @Fosdem
Talk by Sergei Petrunia @PerconaLive
WL #8707
More information
34