The document summarizes new features and improvements to the query optimizer in MariaDB 10.4, including:
1) New default optimizer settings that take more factors into account for condition selectivity and enable the use of histograms.
2) Faster histogram collection using Bernoulli sampling rather than analyzing the whole population.
3) Condition pushdown from HAVING clauses into WHERE clauses and into materialized IN subqueries under certain conditions.
4) Building in-memory primary key filters from range index scans to filter joins more efficiently.
1. Query optimizer:
further down the rabbit hole
Sergei Petrunia Galina Shalygina
Sr. Software Engineer Junior Engineer
MariaDB Corporation MariaDB Corporation
2. Query Optimizer in MariaDB 10.4
● New default optimizer settings
● Faster histogram collection
● Condition pushdown:
○ into materialized IN subqueries
○ from HAVING into WHERE
● In-memory PK filters built from range index scans
● Optimizer trace
4. New default settings
● Condition selectivity computation takes more factors into account
-optimizer_use_condition_selectivity=1
+optimizer_use_condition_selectivity=4
− Better query plans
-use_stat_tables=NEVER
+use_stat_tables=PREFERABLY_FOR_QUERIE
S
− Still need to use ANALYZE TABLE ... PERSISTENT to collect them
● ANALYZE PERSISTENT will build a good histogram
● Optimizer uses EITS statistics (incl. Histograms) if it is present
-histogram_size=0
+histogram_size=254
-histogram_type=SINGLE_PREC_H
B
+histogram_type=DOUBLE_PREC_H
B
− Different rows / filtered in EXPLAIN output
5. New default settings (2)
-eq_range_index_dive_limit=10
-eq_range_index_dive_limit=200
● Join buffer size will auto-size itself
-optimize_join_buffer_size=OFF
+optimize_join_buffer_size=ON
● Large IN-lists use index statistics (cardinality) as estimate
− Estimation of WHERE t.key (1,2,3,...,201) will not do 201 index dives
● will use AVG(records_per_key(t.key))
− Just following MySQL here
− join_buffer_sizesetting is still relevant
7. Histograms
● Available since MariaDB 10.0 (Yes)
● Used by advanced users
● Have shortcomings:
− Expensive to collect
− Usage is not enabled
● => Not used when they should be
8. Histogram collection
● Analyzes the whole population (“census”, not “survey”)
1. Reads all data
2. Performs an expensive computation
● MariaDB 10.4 supports “Bernoulli sampling”
− Still does #1, but fixes #2
● Configuration:
− analyze_sample_percentage=100 (default) – use all data, as before
− analyze_sample_percentage=0 – determine sample ratio automatically
9. Histogram use by the optimizer
● Now enabled by default
● The workflow:
set analyze_sample_percentage=0; -- Optional
analyze table t1
persistent for columns (col1, ...)
indexes (idx1, ...);
analyze table t1
persistent for all;
-- Your queries here
● Now enabled by default
● The workflow:
● More details: “How to use histograms to get better performance”
− Today, 1:30 pm- 2:20 pm, Gallery C
10. MariaDB condition pushdown
MariaDB 10.2:
1. Pushdown conditions into non-mergeable views/derived tables
MariaDB 10.4:
1. Condition pushdown from HAVING into WHERE
2. Push conditions into materialized IN subqueries
12. When it can be used
● There is a condition that depends on grouping fields only in HAVING
● There are no aggregation functions in this condition
● Special variable ‘condition_pushdown_from_having’ is set (set by default)
SELECT c_name,MAX(o_totalprice)
FROM customer, orders
WHERE o_custkey = c_custkey
GROUP BY c_name
HAVING c_name = 'Customer#000000020';
13. How it is made
SELECT c_name,MAX(o_totalprice)
FROM customer, orders
WHERE o_custkey = c_custkey AND
c_name = 'Customer#000000020';
GROUP BY c_name;
SELECT c_name,MAX(o_totalprice)
FROM customer, orders
WHERE o_custkey = c_custkey
GROUP BY c_name
HAVING c_name = 'Customer#000000020';
14. How it is made
SELECT c_name,MAX(o_totalprice)
FROM customer, orders
WHERE o_custkey = c_custkey
GROUP BY c_name
HAVING c_name = 'Customer#000000020';
SELECT c_name,MAX(o_totalprice)
FROM customer, orders
WHERE o_custkey = c_custkey AND
c_name = 'Customer#000000020';
GROUP BY c_name;
15. How it is made
SELECT c_name,MAX(o_totalprice)
FROM customer, orders
WHERE o_custkey = c_custkey
GROUP BY c_name
HAVING c_name = 'Customer#000000020';
SELECT c_name,MAX(o_totalprice)
FROM customer, orders
WHERE o_custkey = c_custkey AND
c_name = 'Customer#000000020';
GROUP BY c_name;
● No temporary table
● No sorting
16. Pushing down using equalities
SELECT l_shipdate,l_receiptdate,MAX(l_quantity)
FROM lineitem
GROUP BY l_shipdate
HAVING l_receiptdate > '1996-11-01' AND
l_shipdate = l_receiptdate;
17. Pushing down using equalities
SELECT l_shipdate,l_receiptdate,MAX(l_quantity)
FROM lineitem
GROUP BY l_shipdate
HAVING l_receiptdate > '1996-11-01' AND
l_shipdate = l_receiptdate;
SELECT l_shipdate,l_receiptdate,MAX(l_quantity)
FROM lineitem
WHERE l_shipdate > '1996-11-01' AND
l_shipdate = l_receiptdate;
GROUP BY l_shipdate;
18. Where you can find it
MariaDB 10.4 MySQL 8.0 PostgreSQL 11.2 Oracle 12c
19. Where you can find it
MariaDB 10.4 MySQL 8.0 PostgreSQL 11.2 Oracle 12c
GROUP BY t1.a
HAVING (t1.a=t1.c) AND (t1.c>1);
PostgreSQL will not allow it
21. When it can be used
● Uncorrelated materialized semi-join
IN-subquery with GROUP BY
SELECT c_name,c_phone
FROM customer
WHERE с_status = 1 AND
c_regyear BETWEEN 1992 AND 1994 AND
(c_custkey,c_status,c_regyear) IN
(
SELECT o_custkey,MIN(o_customerstatus),
o_orderyear
FROM orders
GROUP BY o_custkey,o_orderyear
)
;
22. When it can be used
● Uncorrelated materialized semi-join
IN-subquery with GROUP BY
● There is a condition which
fields consist in the left part of
IN-subquerySELECT c_name,c_phone
FROM customer
WHERE с_status = 1 AND
c_regyear BETWEEN 1992 AND 1994 AND
(c_custkey,c_status,c_regyear) IN
(
SELECT o_custkey,MIN(o_customerstatus),
o_orderyear
FROM orders
GROUP BY o_custkey,o_orderyear
)
;
23. SELECT c_name,c_phone
FROM customer
WHERE с_status = 1 AND
c_regyear BETWEEN 1992 AND 1994 AND
(c_custkey,c_status,c_regyear) IN
(
SELECT o_custkey,MIN(o_customerstatus),
o_orderyear
FROM orders
GROUP BY o_custkey,o_orderyear
)
;
When it can be used
● Uncorrelated materialized semi-join
IN-subquery with GROUP BY
● There is a condition which
fields consist in the left part of
IN-subquery
● Special variable
‘condition_pushdown_for_subquery’
is set (set by default)
24. How condition pushdown is made
SELECT c_name,c_phone
FROM customer
WHERE с_status = 1 AND
c_regyear BETWEEN 1992 AND 1994 AND
(c_custkey,c_status,c_regyear) IN
(
SELECT o_custkey,MIN(o_customerstatus),o_orderyear
FROM orders
GROUP BY o_custkey,o_orderyear
)
;
25. How condition pushdown is made
SELECT c_name,c_phone
FROM customer
WHERE с_status = 1 AND
c_regyear BETWEEN 1992 AND 1994 AND
(c_custkey,c_status,c_regyear) IN
(
SELECT o_custkey,MIN(o_customerstatus),o_orderyear
FROM orders
GROUP BY o_custkey,o_orderyear
HAVING MIN(o_customerstatus) = 1
)
;
26. How condition pushdown is made
SELECT c_name,c_phone
FROM customer
WHERE с_status = 1 AND
c_regyear BETWEEN 1992 AND 1994 AND
(c_custkey,c_status,c_regyear) IN
(
SELECT o_custkey,MIN(o_customerstatus),o_orderyear
FROM orders
GROUP BY o_custkey,o_orderyear
HAVING MIN(o_customerstatus) = 1
)
;
27. How condition pushdown is made
SELECT c_name,c_phone
FROM customer
WHERE с_status = 1 AND
c_regyear BETWEEN 1992 AND 1994 AND
(c_custkey,c_status,c_regyear) IN
(
SELECT o_custkey,MIN(o_customerstatus),o_orderyear
FROM orders
WHERE o_orderyear BETWEEN 1992 AND 1994
GROUP BY o_custkey,o_orderyear
HAVING MIN(o_customerstatus) = 1
)
;
31. What is PK-filter
SELECT o_orderkey, l_linenumber, l_shipdate, o_totalprice
FROM lineitem JOIN orders ON l_orderkey = o_orderkey
WHERE l_shipdate BETWEEN '1997-01-01' AND '1997-06-30' AND
o_totalprice between 200000 and 230000;
32. What is PK-filter
1. There is an index i_o_totalprice on orders(o_totalprice)
SELECT o_orderkey, l_linenumber, l_shipdate, o_totalprice
FROM lineitem JOIN orders ON l_orderkey = o_orderkey
WHERE l_shipdate BETWEEN '1997-01-01' AND '1997-06-30' AND
o_totalprice between 200000 and 230000;
33. What is PK-filter
1. There is an index i_o_totalprice on orders(o_totalprice)
2. C1 cardinality is small in comparison with the cardinality of orders
SELECT o_orderkey, l_linenumber, l_shipdate, o_totalprice
FROM lineitem JOIN orders ON l_orderkey = o_orderkey
WHERE l_shipdate BETWEEN '1997-01-01' AND '1997-06-30' AND
o_totalprice between 200000 and 230000;
C1
34. What is PK-filter
1. There is an index i_o_totalprice on orders(o_totalprice)
2. C1 cardinality is small in comparison with the cardinality of orders
SELECT o_orderkey, l_linenumber, l_shipdate, o_totalprice
FROM lineitem JOIN orders ON l_orderkey = o_orderkey
WHERE l_shipdate BETWEEN '1997-01-01' AND '1997-06-30' AND
o_totalprice between 200000 and 230000;
C1
Try to build a filter!
36. What is PK-filter
o_totalprice between 200000 and 230000 + i_o_totalprice =
range scan using i_o_totalprice
collect Primary Keys
for the rows in this range
PK for C1
37. What is PK-filter
o_totalprice between 200000 and 230000 + i_o_totalprice =
range scan using i_o_totalprice
collect Primary Keys
for the rows in this range
PK for C1 sort it
38. What is PK-filter
o_totalprice between 200000 and 230000 + i_o_totalprice =
range scan using i_o_totalprice
collect Primary Keys
for the rows in this range
PK for C1 sort it
PK filter built from range index scan
40. How it works
+------+-------------+----------+---------------+---------------------------------------------------------+------------
------------+---------+--------------------------------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key
| key_len | ref | rows | Extra |
+------+-------------+----------+---------------+---------------------------------------------------------+------------
------------+---------+--------------------------------+--------+---------------------------------+
| 1 | SIMPLE | lineitem | range | PRIMARY,i_l_shipdate,i_l_orderkey,i_l_orderkey_quantity |
i_l_shipdate | 4 | NULL | 98 | Using index condition |
| 1 | SIMPLE | orders | eq_ref|filter | PRIMARY,i_o_totalprice |
PRIMARY|i_o_totalprice | 4|9 | dbt3_small.lineitem.l_orderkey | 1 (5%) | Using where; Using rowid filter |
+------+-------------+----------+---------------+---------------------------------------------------------+------------
------------+---------+--------------------------------+--------+---------------------------------+
41. How it works
"rowid_filter": {
"range": {
"key": "i_o_totalprice",
"used_key_parts": ["o_totalprice"]
},
"rows": 81,
"selectivity_pct": 5.4,
"r_rows": 71,
"r_selectivity_pct": 10.417,
"r_buffer_size": 53,
"r_filling_time_ms": 0.0482
},
42. Limitations
● Check if ‘rowid_filter’ special variable is set (set by default)
● PK-filter size shouldn’t exceed ‘max_rowid_filter_size’
○ 128 KB by default
● Index on which filter is built is not clustered primary
● Engines that support rowid filters
○ InnoDB
○ MyISAM
Find more in MDEV-16188
43. Improvement: MyISAM
SELECT l_quantity, l_shipdate
FROM lineitem, orders
WHERE l_orderkey=o_orderkey AND
o_totalprice BETWEEN 300000 AND 330000 AND
l_shipdate BETWEEN '1996-11-01' AND '1996-11-14' AND
l_quantity=15;
Indexes on:
1. l_quantity
2. l_shipdate
3. o_totalprice
44. Improvement: MyISAM
SELECT l_quantity, l_shipdate
FROM lineitem, orders
WHERE l_orderkey=o_orderkey AND
o_totalprice BETWEEN 300000 AND 330000 AND
l_shipdate BETWEEN '1996-11-01' AND '1996-11-14' AND
l_quantity=15;
Indexes on:
1. l_quantity
2. l_shipdate
3. o_totalprice
max_rowid_filter_size = 24 MB
46. Improvement: MyISAM
DBT3 with optimization without optimization
5 GB 1.147 sec 15.005 sec
15 GB 3.281 sec 44.363 sec
30 GB: SSD 6.552 sec 1 min 28.347 sec
30 GB: HDD 37.234 sec 7 min 29.090 sec
47. Improvement: InnoDB
SELECT *
FROM part,lineitem,partsupp
WHERE p_partkey = ps_partkey AND
l_suppkey=ps_suppkey AND
p_retailprice BETWEEN 1080 AND 1100 AND
l_shipdate BETWEEN '1996-10-01' AND '1997-02-01';
51. Optimizer trace
● Available in MySQL since MySQL 5.6
mysql> set optimizer_trace=1;
mysql> <query>;
mysql> select * from
-> information_schema.optimizer_trace;
"steps": [
{
"join_preparation": {
"select#": 1,
"steps": [
{
"expanded_query": "/* select#1 */ select `t1`.`col1` AS `col1`,`t1`.`col2`
AS `col2` from `t1` where (`t1`.`col1` < 4)"
}
]
}
},
{
"join_optimization": {
"select#": 1,
"steps": [
{
"condition_processing": {
"condition": "WHERE",
"original_condition": "(`t1`.`col1` < 4)",
"steps": [
{
"transformation": "equality_propagation",
"resulting_condition": "(`t1`.`col1` < 4)"
},
{
"transformation": "constant_propagation",
"resulting_condition": "(`t1`.`col1` < 4)"
},
{
"transformation": "trivial_condition_removal",
"resulting_condition": "(`t1`.`col1` < 4)"
}
]
}
● Now, a similar feature in MariaDB
● Explains optimizer choices
52. The goal is to understand the optimizer
● “Why was query plan X not chosen?”
− It had higher cost (due to incorrect statistics ?)
− Limitation in the optimizer?
● What rewrites happen
− Does “X=10 AND FUNC(X)” -> “FUNC(10)” work?
− Or any other suspicious rewrite of the day
● What changed between the two hosts/versions
− diff /tmp/trace_from_host1.json /tmp/trace_from_host2.json
● ...
53. A user case: range optimizer
● Complex WHERE clause and multi-component index make it unclear what ranges
will be scanned
● A classic example:
create table some_events (
start_date DATE,
end_date DATE,
...
KEY (start_date, end_date)
);
"rows_estimation": [
{
"table": "some_events",
...
"analyzing_range_alternatives": {
"range_scan_alternatives": [
{
"index": "start_date",
"ranges": ["0x4ac60f <= start_date"],
"rowid_ordered": false,
"using_mrr": false,
"index_only": false,
..
select ...
from some_events
where
start_date >= '2019-02-10' and
end_date <= '2019-04-01'
54. Customer Case: a VIEW that stopped merging
● A big join query with lots of nested views
● Performance drop after a minor change to a VIEW
− EXPLAIN shows the view is no longer merged
● Initial idea: the change added a LEFT JOIN, so it must be it
"view": {
"table": "view_name_8",
"select_id": 9,
"algorithm": "merged"
}
"view": {
"table": "view_name_8",
"select_id": 9,
"algorithm": "materialized",
"cause": "Not enough table bits to merge
subquery"
}
● (Due to Table Elimination, EXPLAIN showed <64 tables both before and after)
55. Customer Case 2: no materialization
● Subquery materialization was not used.
+------+--------------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------+------+---------------+------+---------+------+---------+-------------+
| 1 | PRIMARY | t1 | ALL | NULL | NULL | NULL | NULL | 10 | Using where |
| 2 | DEPENDENT SUBQUERY | t2 | ALL | NULL | NULL | NULL | NULL | 1000000 | Using where |
+------+--------------------+-------+------+---------------+------+---------+------+---------+-------------+
"join_preparation": {
"select_id": 2,
"steps": [
{
"transformation": {
"select_id": 2,
"from": "IN (SELECT)",
"to": "materialization",
"possible": false,
"cause": "types mismatch"
}
● Different datatypes disallow Materialization
● A non-obvious limitation
− Required a server developer with a
debugger to figure out
select * from t1 where t1.col in (select t2.col from t2) or ...
57. Optimizer trace summary
● Allows to examine how optimizer processes the query
● Mostly for manual troubleshooting
● Good for bug reporting too
● Currently prints the essentials
− Will print more in the future.