Query optimizer: further down the rabbit hole

Query optimizer:
further down the rabbit hole
Sergei Petrunia Galina Shalygina
Sr. Software Engineer Junior Engineer
MariaDB Corporation MariaDB Corporation

Query Optimizer in MariaDB 10.4
● New default optimizer settings
● Faster histogram collection
● Condition pushdown:
○ into materialized IN subqueries
○ from HAVING into WHERE
● In-memory PK filters built from range index scans
● Optimizer trace

New default settings
● Condition selectivity computation takes more factors into account
-optimizer_use_condition_selectivity=1
+optimizer_use_condition_selectivity=4
− Better query plans
-use_stat_tables=NEVER
+use_stat_tables=PREFERABLY_FOR_QUERIE
S
− Still need to use ANALYZE TABLE ... PERSISTENT to collect them
● ANALYZE PERSISTENT will build a good histogram
● Optimizer uses EITS statistics (incl. Histograms) if it is present
-histogram_size=0
+histogram_size=254
-histogram_type=SINGLE_PREC_H
B
+histogram_type=DOUBLE_PREC_H
B
− Different rows / filtered in EXPLAIN output

New default settings (2)
-eq_range_index_dive_limit=10
-eq_range_index_dive_limit=200
● Join buffer size will auto-size itself
-optimize_join_buffer_size=OFF
+optimize_join_buffer_size=ON
● Large IN-lists use index statistics (cardinality) as estimate
− Estimation of WHERE t.key (1,2,3,...,201) will not do 201 index dives
● will use AVG(records_per_key(t.key))
− Just following MySQL here
− join_buffer_sizesetting is still relevant

Histograms
● Available since MariaDB 10.0 (Yes)
● Used by advanced users
● Have shortcomings:
− Expensive to collect
− Usage is not enabled
● => Not used when they should be

Histogram collection
● Analyzes the whole population (“census”, not “survey”)
1. Reads all data
2. Performs an expensive computation
● MariaDB 10.4 supports “Bernoulli sampling”
− Still does #1, but fixes #2
● Configuration:
− analyze_sample_percentage=100 (default) – use all data, as before
− analyze_sample_percentage=0 – determine sample ratio automatically

Histogram use by the optimizer
● Now enabled by default
● The workflow:
set analyze_sample_percentage=0; -- Optional
analyze table t1
persistent for columns (col1, ...)
indexes (idx1, ...);
analyze table t1
persistent for all;
-- Your queries here
● Now enabled by default
● The workflow:
● More details: “How to use histograms to get better performance”
− Today, 1:30 pm- 2:20 pm, Gallery C

MariaDB condition pushdown
MariaDB 10.2:
1. Pushdown conditions into non-mergeable views/derived tables
MariaDB 10.4:
1. Condition pushdown from HAVING into WHERE
2. Push conditions into materialized IN subqueries

Condition pushdown from
HAVING into WHERE

When it can be used
● There is a condition that depends on grouping fields only in HAVING
● There are no aggregation functions in this condition
● Special variable ‘condition_pushdown_from_having’ is set (set by default)
SELECT c_name,MAX(o_totalprice)
FROM customer, orders
WHERE o_custkey = c_custkey
GROUP BY c_name
HAVING c_name = 'Customer#000000020';

How it is made
WHERE o_custkey = c_custkey AND
c_name = 'Customer#000000020';
GROUP BY c_name;
GROUP BY c_name

How it is made
GROUP BY c_name
GROUP BY c_name;

How it is made
GROUP BY c_name
GROUP BY c_name;
● No temporary table
● No sorting

Pushing down using equalities
SELECT l_shipdate,l_receiptdate,MAX(l_quantity)
FROM lineitem
GROUP BY l_shipdate
HAVING l_receiptdate > '1996-11-01' AND
l_shipdate = l_receiptdate;

Pushing down using equalities
FROM lineitem
GROUP BY l_shipdate
HAVING l_receiptdate > '1996-11-01' AND
FROM lineitem
WHERE l_shipdate > '1996-11-01' AND
GROUP BY l_shipdate;

Where you can find it
MariaDB 10.4 MySQL 8.0 PostgreSQL 11.2 Oracle 12c

Where you can find it
MariaDB 10.4 MySQL 8.0 PostgreSQL 11.2 Oracle 12c
GROUP BY t1.a
HAVING (t1.a=t1.c) AND (t1.c>1);
PostgreSQL will not allow it

Condition pushdown into
materialized IN subquery

When it can be used
● Uncorrelated materialized semi-join
IN-subquery with GROUP BY
SELECT c_name,c_phone
FROM customer
WHERE с_status = 1 AND
c_regyear BETWEEN 1992 AND 1994 AND
(c_custkey,c_status,c_regyear) IN
(
SELECT o_custkey,MIN(o_customerstatus),
o_orderyear
FROM orders
GROUP BY o_custkey,o_orderyear
)
;

When it can be used
● There is a condition which
fields consist in the left part of
IN-subquerySELECT c_name,c_phone
FROM customer
(
o_orderyear
FROM orders
)
;

FROM customer
(
o_orderyear
FROM orders
)
;
When it can be used
● There is a condition which
fields consist in the left part of
IN-subquery
● Special variable
‘condition_pushdown_for_subquery’
is set (set by default)

How condition pushdown is made
FROM customer
(
SELECT o_custkey,MIN(o_customerstatus),o_orderyear
FROM orders
)
;

FROM customer
(
FROM orders
HAVING MIN(o_customerstatus) = 1
)
;

FROM customer
(
FROM orders
WHERE o_orderyear BETWEEN 1992 AND 1994
HAVING MIN(o_customerstatus) = 1
)
;

Improvement
DBT3, MyISAM with optimization without optimization
1 GB 0.013 sec 0.017 sec
5 GB 7.185 sec 2 min 51.705 sec
15 GB 11.003 sec 12 min 47.846 sec

In-memory PK filters built
from range index scans

What is PK-filter
SELECT o_orderkey, l_linenumber, l_shipdate, o_totalprice
FROM lineitem JOIN orders ON l_orderkey = o_orderkey
WHERE l_shipdate BETWEEN '1997-01-01' AND '1997-06-30' AND
o_totalprice between 200000 and 230000;

What is PK-filter
1. There is an index i_o_totalprice on orders(o_totalprice)

What is PK-filter
2. C1 cardinality is small in comparison with the cardinality of orders
C1

What is PK-filter
2. C1 cardinality is small in comparison with the cardinality of orders
C1
Try to build a filter!

What is PK-filter
o_totalprice between 200000 and 230000 + i_o_totalprice =
range scan using i_o_totalprice

What is PK-filter
collect Primary Keys
for the rows in this range
PK for C1

What is PK-filter
PK for C1 sort it

What is PK-filter
PK for C1 sort it
PK filter built from range index scan

How it works
PK-filter
orders lineitem
o_orderkey = l_orderkey

How it works
"rowid_filter": {
"range": {
"key": "i_o_totalprice",
"used_key_parts": ["o_totalprice"]
},
"rows": 81,
"selectivity_pct": 5.4,
"r_rows": 71,
"r_selectivity_pct": 10.417,
"r_buffer_size": 53,
"r_filling_time_ms": 0.0482
},

Limitations
● Check if ‘rowid_filter’ special variable is set (set by default)
● PK-filter size shouldn’t exceed ‘max_rowid_filter_size’
○ 128 KB by default
● Index on which filter is built is not clustered primary
● Engines that support rowid filters
○ InnoDB
○ MyISAM
Find more in MDEV-16188

Improvement: MyISAM
SELECT l_quantity, l_shipdate
FROM lineitem, orders
WHERE l_orderkey=o_orderkey AND
o_totalprice BETWEEN 300000 AND 330000 AND
l_shipdate BETWEEN '1996-11-01' AND '1996-11-14' AND
l_quantity=15;
Indexes on:
1. l_quantity
2. l_shipdate
3. o_totalprice

Improvement: MyISAM
SELECT l_quantity, l_shipdate
FROM lineitem, orders
WHERE l_orderkey=o_orderkey AND
o_totalprice BETWEEN 300000 AND 330000 AND
l_shipdate BETWEEN '1996-11-01' AND '1996-11-14' AND
l_quantity=15;
Indexes on:
1. l_quantity
2. l_shipdate
3. o_totalprice
max_rowid_filter_size = 24 MB

Improvement: MyISAM
DBT3 with optimization without optimization
5 GB 1.147 sec 15.005 sec
15 GB 3.281 sec 44.363 sec
30 GB: SSD 6.552 sec 1 min 28.347 sec
30 GB: HDD 37.234 sec 7 min 29.090 sec

Improvement: InnoDB
SELECT *
FROM part,lineitem,partsupp
WHERE p_partkey = ps_partkey AND
l_suppkey=ps_suppkey AND
p_retailprice BETWEEN 1080 AND 1100 AND
l_shipdate BETWEEN '1996-10-01' AND '1997-02-01';

Improvement: InnoDB
DBT3 with optimization without optimization
5 GB 27.669 sec 5 min 41.049 sec
15 GB 8 min 28.506 sec > 50 min

Optimizer trace
● Available in MySQL since MySQL 5.6
mysql> set optimizer_trace=1;
mysql> <query>;
mysql> select * from
-> information_schema.optimizer_trace;
"steps": [
{
"join_preparation": {
"select#": 1,
"steps": [
{
"expanded_query": "/* select#1 */ select `t1`.`col1` AS `col1`,`t1`.`col2`
AS `col2` from `t1` where (`t1`.`col1` < 4)"
}
]
}
},
{
"join_optimization": {
"select#": 1,
"steps": [
{
"condition_processing": {
"condition": "WHERE",
"original_condition": "(`t1`.`col1` < 4)",
"steps": [
{
"transformation": "equality_propagation",
"resulting_condition": "(`t1`.`col1` < 4)"
},
{
"transformation": "constant_propagation",
},
{
"transformation": "trivial_condition_removal",
}
]
}
● Now, a similar feature in MariaDB
● Explains optimizer choices

The goal is to understand the optimizer
● “Why was query plan X not chosen?”
− It had higher cost (due to incorrect statistics ?)
− Limitation in the optimizer?
● What rewrites happen
− Does “X=10 AND FUNC(X)” -> “FUNC(10)” work?
− Or any other suspicious rewrite of the day
● What changed between the two hosts/versions
− diff /tmp/trace_from_host1.json /tmp/trace_from_host2.json
● ...

A user case: range optimizer
● Complex WHERE clause and multi-component index make it unclear what ranges
will be scanned
● A classic example:
create table some_events (
start_date DATE,
end_date DATE,
...
KEY (start_date, end_date)
);
"rows_estimation": [
{
"table": "some_events",
...
"analyzing_range_alternatives": {
"range_scan_alternatives": [
{
"index": "start_date",
"ranges": ["0x4ac60f <= start_date"],
"rowid_ordered": false,
"using_mrr": false,
"index_only": false,
..
select ...
from some_events
where
start_date >= '2019-02-10' and
end_date <= '2019-04-01'

Customer Case: a VIEW that stopped merging
● A big join query with lots of nested views
● Performance drop after a minor change to a VIEW
− EXPLAIN shows the view is no longer merged
● Initial idea: the change added a LEFT JOIN, so it must be it
"view": {
"table": "view_name_8",
"select_id": 9,
"algorithm": "merged"
}
"view": {
"table": "view_name_8",
"select_id": 9,
"algorithm": "materialized",
"cause": "Not enough table bits to merge
subquery"
}
● (Due to Table Elimination, EXPLAIN showed <64 tables both before and after)

Customer Case 2: no materialization
● Subquery materialization was not used.
+------+--------------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------+------+---------------+------+---------+------+---------+-------------+
| 1 | PRIMARY | t1 | ALL | NULL | NULL | NULL | NULL | 10 | Using where |
| 2 | DEPENDENT SUBQUERY | t2 | ALL | NULL | NULL | NULL | NULL | 1000000 | Using where |
+------+--------------------+-------+------+---------------+------+---------+------+---------+-------------+
"join_preparation": {
"select_id": 2,
"steps": [
{
"transformation": {
"select_id": 2,
"from": "IN (SELECT)",
"to": "materialization",
"possible": false,
"cause": "types mismatch"
}
● Different datatypes disallow Materialization
● A non-obvious limitation
− Required a server developer with a
debugger to figure out
select * from t1 where t1.col in (select t2.col from t2) or ...

Optimizer trace structure
TRACE: steps: {
join_preparation+,
join_optimization+,
(join_explain | join_execution)+
}
join_optimization : steps {
condition_processing,
substitute_generated_columns,
table_dependencies,
ref_optimizer_key_uses,
rows_estimation,
considered_execution_plans,
attaching_conditions_to_tables,
refine_plan,
}
join_preparation : {
expanded_query
}
join_preparation : {
expanded_query
}
rows_estimation: {
analyzing_range_alternatives : { ... }
selectivity_for_indexes,
selectivity_for_columns,
cond_selectivity: 0.nnnn
}

Optimizer trace summary
● Allows to examine how optimizer processes the query
● Mostly for manual troubleshooting
● Good for bug reporting too
● Currently prints the essentials
− Will print more in the future.

Thanks for your
attention!
Come to
“How to use histograms to
get better performance”
Today, 1:30 pm- 2:20 pm, Gallery C

Query optimizer: further down the rabbit hole

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Query optimizer: further down the rabbit hole

Similar to Query optimizer: further down the rabbit hole (20)

More from MariaDB plc

More from MariaDB plc (20)

Recently uploaded

Recently uploaded (20)

Query optimizer: further down the rabbit hole