Optimizer Histograms: When they Help and When Do Not?

Optimizer Histograms
When they Help and When Do Not?
February, 01, 2019
Sveta Smirnova

• MySQL Support engineer
• Author of
• MySQL Troubleshooting
• JSON UDF functions
• FILTER clause for MySQL
• Speaker
• Percona Live, OOW, Fosdem,
DevConf, HighLoad...
Sveta Smirnova
2

•Why do I Care?
•The Use Case
•Even Worse Use Case
•Why the Diﬀerence?
•How Histograms Work?
Table of Contents
3

The column statistics data dictionary table stores histogram statistics about
column values, for use by the optimizer in constructing query execution plans
MySQL User Reference Manual
Optimizer Statistics aka Histograms
4

• Data distribution vary
•
Big diﬀerence between number of values
•
Costantly changing
Latest Support Tickets
6

• Cardinality is not correct
• Was not updated in time
•
Updates too often
• Calculated wrongly
6

• Index maintenance costs a lot
• Hardware resources
•
Slow updates
• Window to run CREATE INDEX
6

• Index maintenance costs a lot
• Optimizer does not work as we wish to
Examples in my talk @Percona Live
6

• Topic based on real Support cases
•
Couple of them are still in progress
Disclaimer
7

• All examples are 100% fake
•
They created such that
• No customer can be identiﬁed
• Everything generated
Table names
Column names
Data
• Use case itself is ﬁctional
Disclaimer
7

• All examples are simpliﬁed
• Only columns, required to show the issue
•
Everything extra removed
• Real tables usually store much more data
Disclaimer
7

• All examples are simpliﬁed
• All disasters happened with version 5.7
Disclaimer
7

•
categories
• Less than 20 rows
Two tables
9

•
categories
• Less than 20 rows
• goods
• More than 1M rows
• 20 unique cat id values
• Many other ﬁelds
Price
Date: added, last updated, etc.
Characteristics
Store
...
Two tables
9

select *
from
goods
join
categories
on
(categories.id=goods.cat_id)
where
date_added between ’2018-07-01’ and ’2018-08-01’
and
cat_id in (16,11)
and
price >= 1000 and <=10000 [ and ... ]
[ GROUP BY ... [ORDER BY ... [ LIMIT ...]]]
;
JOIN
10

• Select from the Small Table
Option 1: Select from the Small Table First
11

• For each cat id select from the large table
11

• Filter result on date added[ and price[...]]
11

• Filter result on date added[ and price[...]]
• Slow with many items in the category
11

• Filter rows by date added[ and price[...]]
Option 2: Select from the Large Table First
12

• Get cat id values
12

• Retrieve rows from the small table
12

• Retrieve rows from the small table
• Slow if number of rows, ﬁltered by
date added, is larger than number of goods in
the selected categories
12

•
CREATE INDEX index everything
(cat id, date added[, price[, ...]])
• It resolves the issue
What if use Combined Indexes?
13

•
CREATE INDEX index everything
(cat id, date added[, price[, ...]])
• It resolves the issue
• But not in all cases
What if use Combined Indexes?
13

• Maintenance cost
•
Slower INSERT/UPDATE/DELETE
• Disk space
The Problem
14

•
• Disk space
• Index not useful for selecting rows
JOIN categories ON (categories.id=goods.cat_id)
JOIN shops ON (shops.id=goods.shop_id)
[ JOIN ... ]
WHERE
date_added between ’2018-07-01’ and ’2018-08-01’
AND
cat_id in (16,11) AND price >= 1000 AND price <=10000 [ AND ... ]
GROUP BY product_type
ORDER BY date_updated DESC
LIMIT 50,100
The Problem
14

•
• Disk space
• Index not useful for selecting rows
• Tables may have wrong cardinality
The Problem
14

• EXPLAIN without histograms
mysql> explain select goods.* from goods
-> join categories on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’ -- Large range
-> order by goods.cat_id
-> limit 10G -- We ask for 10 rows only!
Example
15

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: categories -- Small table first
partitions: NULL
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 20
filtered: 70.00
Extra: Using where; Using index;
Using temporary; Using filesort
Example
15

*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: goods -- Large table
partitions: NULL
type: ref
possible_keys: cat_id_2
key: cat_id_2
key_len: 5
ref: orig.categories.id
rows: 51827
filtered: 11.11 -- Default value
Extra: Using where
2 rows in set, 1 warning (0.01 sec)
Example
15

• Execution time without histograms
mysql> flush status;
Query OK, 0 rows affected (0.00 sec)
mysql> select goods.* from goods
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> date_added between ’2000-01-01’ and ’2001-01-01’
-> limit 10;
ab9f9bb7bc4f357712ec34f067eda364 -
10 rows in set (56.47 sec)
Example
15

• Engine statistics without histograms
mysql> show status like ’Handler%’;
+----------------------------+--------+
| Variable_name | Value |
+----------------------------+--------+
...
| Handler_read_next | 964718 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 10 |
| Handler_read_rnd_next | 951671 |
...
| Handler_write | 951670 |
+----------------------------+--------+
Example
15

• EXPLAIN with the histogram
mysql> explain select goods.* from goods
-> join categories
-> on (categories.id=goods.cat_id)
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> limit 10G
Example
15

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: goods -- Large table first
partitions: NULL
type: index
possible_keys: cat_id_2
key: cat_id_2
key_len: 5
ref: NULL
rows: 10 -- Same as we asked
filtered: 98.70 -- True numbers
Extra: Using where
Example
15

*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: categories -- Small table
partitions: NULL
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: orig.goods.cat_id
rows: 1
filtered: 100.00
Extra: Using index
Example
15

• Execution time with the histogram
mysql> flush status;
mysql> select goods.* from goods
-> where cat_id in (20,2,18,4,16,6,14,1,12,11,10,9,8,17)
-> and
-> limit 10;
eeb005fae0dd3441c5c380e1d87fee84 -
10 rows in set (0.00 sec) -- 56/0 times faster!
Example
15

• Engine statistics with the histogram
+----------------------------+-------++----------------------------+-------+
| Variable_name | Value || Variable_name | Value |
+----------------------------+-------++----------------------------+-------+
| Handler_commit | 1 || Handler_read_prev | 0 |
| Handler_delete | 0 || Handler_read_rnd | 0 |
| Handler_discover | 0 || Handler_read_rnd_next | 0 |
| Handler_external_lock | 4 || Handler_rollback | 0 |
| Handler_mrr_init | 0 || Handler_savepoint | 0 |
| Handler_prepare | 0 || Handler_savepoint_rollback | 0 |
| Handler_read_first | 1 || Handler_update | 0 |
| Handler_read_key | 3 || Handler_write | 0 |
| Handler_read_last | 0 |+----------------------------+-------+
| Handler_read_next | 9 |18 rows in set (0.00 sec)
Example
15

•
goods characteristics
CREATE TABLE ‘goods_characteristics‘ (
‘id‘ int(11) NOT NULL AUTO_INCREMENT,
‘good_id‘ varchar(30) DEFAULT NULL,
‘size‘ int(11) DEFAULT NULL,
‘manufacturer‘ varchar(30) DEFAULT NULL,
PRIMARY KEY (‘id‘),
KEY ‘good_id‘ (‘good_id‘,‘size‘,‘manufacturer‘),
KEY ‘size‘ (‘size‘,‘manufacturer‘)
) ENGINE=InnoDB AUTO_INCREMENT=196606 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Two Similar Tables
17

•
goods shops
CREATE TABLE ‘goods_shops‘ (
‘id‘ int(11) NOT NULL AUTO_INCREMENT,
‘good_id‘ varchar(30) DEFAULT NULL,
‘location‘ varchar(30) DEFAULT NULL,
‘delivery_options‘ varchar(30) DEFAULT NULL,
PRIMARY KEY (‘id‘),
KEY ‘good_id‘ (‘good_id‘,‘location‘,‘delivery_options‘),
KEY ‘location‘ (‘location‘,‘delivery_options‘)
) ENGINE=InnoDB AUTO_INCREMENT=131071 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Two Similar Tables
17

• Size
mysql> select count(*) from goods_characteristics;
+----------+
| count(*) |
+----------+
| 131072 |
+----------+
mysql> select count(*) from goods_shops;
+----------+
| count(*) |
+----------+
| 65536 |
+----------+
Two Similar Tables
17

• Data Distribution: goods characteristics
mysql> select count(*) num_rows, good_id, size
-> from goods_characteristics group by good_id, size;
+----------+---------+------+
| num_rows | good_id | size |
+----------+---------+------+
| 65536 | laptop | 7 |
| 8187 | laptop | 8 |
| 8190 | laptop | 9 |
| 8188 | laptop | 10 |
| 8192 | laptop | 11 |
| 8189 | laptop | 12 |
| 8189 | laptop | 13 |
| 8191 | laptop | 14 |
| 8190 | laptop | 15 |
| 10 | laptop | 16 |
| 10 | laptop | 17 |
+----------+---------+------+
Two Similar Tables
17

Histogram statistics are useful primarily for nonindexed columns. Adding an
index to a column for which histogram statistics are applicable might also help
the optimizer make row estimates. The tradeoffs are:
An index must be updated when table data is modified.
A histogram is created or updated only on demand, so it adds no overhead
when table data is modified. On the other hand, the statistics become progres-
sively more out of date when table modifications occur, until the next time they
are updated.
Optimizer Statistics aka Histograms
18

mysql> alter table goods_characteristics stats_sample_pages=5000;
Records: 0 Duplicates: 0 Warnings: 0
mysql> alter table goods_shops stats_sample_pages=5000;
Records: 0 Duplicates: 0 Warnings: 0
mysql> analyze table goods_characteristics, goods_shops;
+----------------------------+---------+----------+----------+
+----------------------------+---------+----------+----------+
| test.goods_characteristics | analyze | status | OK |
| test.goods_shops | analyze | status | OK |
+----------------------------+---------+----------+----------+
Index Statistics is More than Good
19

• The query
mysql> select count(*) from goods_shops join goods_characteristics using (good_id)
-> where size < 12 and manufacturer in (’Lenovo’, ’Dell’, ’Toshiba’, ’Samsung’, ’Acer’)
-> and (location in (’Moscow’, ’Kiev’) or delivery_options in (’Premium’, ’Urgent’));
^C^C -- query aborted
ERROR 1317 (70100): Query execution was interrupted
Performance?
20

• Handlers
+----------------------------+-------------+
+----------------------------+-------------+
| Handler_commit | 0 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 4 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 13043 |
| Handler_read_last | 0 |
| Handler_read_next | 854,767,916 |
...
Performance?
20

• Table order
mysql> explain select count(*) from goods_shops join goods_characteristics using (good_id)
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| id | table | type | key | rows | filtered | Extra |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| 1 | goods_characteristics | index | good_id | 131072 | 25.00 | Using where; Using index |
| 1 | goods_shops | ref | good_id | 65536 | 36.00 | Using where; Using index |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
Performance?
20

• Table order matters
mysql> explain select count(*) from goods_shops straight_join goods_characteristics
-> using (good_id)
+----+-----------------------+-------+---------+--------+----------+--------------------------+
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| 1 | goods_shops | index | good_id | 65536 | 36.00 | Using where; Using index |
| 1 | goods_characteristics | ref | good_id | 131072 | 25.00 | Using where; Using index |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
Performance?
20

• Table order matters
mysql> select count(*) from goods_shops straight_join goods_characteristics using (good_id)
+----------+
| count(*) |
+----------+
| 816640 |
+----------+
mysql> show status like ’Handler_read_next’;
+-------------------+-----------+
+-------------------+-----------+
+-------------------+-----------+
Performance?
20

mysql> analyze table goods_shops update histogram on location, delivery_options;
+-------------+-----------+----------+-----------------------------------------------------+
+-------------+-----------+----------+-----------------------------------------------------+
| goods_shops | histogram | status | Histogram statistics created... ’delivery_options’. |
| goods_shops | histogram | status | Histogram statistics created for column ’location’. |
+-------------+-----------+----------+-----------------------------------------------------+
mysql> analyze table goods_characteristics update histogram on size, manufacturer ;
+-----------------------+-----------+----------+-------------------------------------------------+
+-----------------------+-----------+----------+-------------------------------------------------+
| goods_characteristics | histogram | status | Histogram statistics created... ’manufacturer’. |
| goods_characteristics | histogram | status | Histogram statistics created for column ’size’. |
+-----------------------+-----------+----------+-------------------------------------------------+
Histograms to Rescue
21

• The query
mysql> select count(*) from goods_shops join goods_characteristics using (good_id)
+----------+
| count(*) |
+----------+
| 816640 |
+----------+
mysql> show status like ’Handler_read_next’;
+-------------------+-----------+
+-------------------+-----------+
+-------------------+-----------+
21

• Filtering eﬀect
mysql> explain select count(*) from goods_shops join goods_characteristics using (good_id) where s
+----+-----------------------+-------+---------+--------+----------+--------------------------+
+----+-----------------------+-------+---------+--------+----------+--------------------------+
| 1 | goods_shops | index | good_id | 65536 | 0.06 | Using where; Using index |
| 1 | goods_characteristics | ref | good_id | 131072 | 15.63 | Using where; Using index |
+----+-----------------------+-------+---------+--------+----------+--------------------------+
21

1 2 3 4 5 6 7 8 9 10
0
200
400
600
800
Indexes: Number of Items with Same Value
23

1 2 3 4 5 6 7 8 9 10
0
200
400
600
800
Indexes: Cardinality
24

1 2 3 4 5 6 7 8 9 10
0
200
400
600
800
Histograms: Number of Values in Each Bucket
25

1 2 3 4 5 6 7 8 9 10
0
0.2
0.4
0.6
0.8
1
Histograms: Data in the Histogram
26

↓ sql/sql planner.cc
Low Level
28

↓ calculate condition filter
Low Level
28

↓ Item func *::get filtering effect
Low Level
28

• get histogram selectivity
Low Level
28

• get histogram selectivity
• Seen as a percent of ﬁltered rows in EXPLAIN
Low Level
28

• Example data
mysql> create table example(f1 int) engine=innodb;
mysql> insert into example values(1),(1),(1),(2),(3);
mysql> select f1, count(f1) from example group by f1;
+------+-----------+
| f1 | count(f1) |
+------+-----------+
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
+------+-----------+
Filtered Rows
29

• Without a histogram
mysql> explain select * from example where f1 > 0G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Filtered Rows
29

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 33.33
Extra: Using where
Filtered Rows
29

• With the histogram
mysql> analyze table example update histogram on f1 with 3 buckets;
+-----------------+-----------+----------+------------------------------+
+-----------------+-----------+----------+------------------------------+
| hist_ex.example | histogram | status | Histogram statistics created
for column ’f1’. |
+-----------------+-----------+----------+------------------------------+
Filtered Rows
29

mysql> select * from information_schema.column_statistics
-> where table_name=’example’G
*************************** 1. row ***************************
SCHEMA_NAME: hist_ex
TABLE_NAME: example
COLUMN_NAME: f1
HISTOGRAM:
"buckets": [[1, 0.6], [2, 0.8], [3, 1.0]],
"data-type": "int", "null-values": 0.0, "collation-id": 8,
"last-updated": "2018-11-07 09:07:19.791470",
"sampling-rate": 1.0, "histogram-type": "singleton",
"number-of-buckets-specified": 3
Filtered Rows
29

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 100.00 -- all rows
Extra: Using where
Filtered Rows
29

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 40.00 -- 2 rows
Extra: Using where
Filtered Rows
29

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 20.00 -- one row
Extra: Using where
Filtered Rows
29

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: example
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 20.00 - one row
Extra: Using where
Filtered Rows
29

•
CREATE INDEX
• Metadata lock
•
Can be blocked by any query
Locking
30

•
CREATE INDEX
• Metadata lock
•
Can be blocked by any query
• UPDATE HISTOGRAM
• Backup lock
• Can be locked only by a backup
•
Can be created any time without fear
Locking
30

• Helps if query plan can be changed
• Not a replacement for the index:
•
GROUP BY
• ORDER BY
• Query on a single table ∗
Outcome
31

• Data distribution is uniform
• Range optimization can be used
• Full table scan is fast
When Histogram are not Helpful?
32

• Index statistics collected by the engine
• Optimizer calculates Cardinality each time
when accesses statistics
•
Indexes not always improve performance
• Histograms can help
Still new feature
• Histograms do not replace other optimizations!
Conclusion
33

Blog by Erik Froseth
Blog by Frederic Descamps
Talk by Oystein Grovlen @Fosdem
Talk by Sergei Petrunia @PerconaLive
WL #8707
More information
34

www.slideshare.net/SvetaSmirnova
twitter.com/svetsmirnova
github.com/svetasmirnova
Thank you!
35

Optimizer Histograms: When they Help and When Do Not?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Optimizer Histograms: When they Help and When Do Not?

Ähnlich wie Optimizer Histograms: When they Help and When Do Not? (20)

Mehr von Sveta Smirnova

Mehr von Sveta Smirnova (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Optimizer Histograms: When they Help and When Do Not?