Wednesday, July 20, 2011

More on OR-conditions considered bad... And an apology..

In my recent post on OR-conditions I made a mistake, and I appologize for that. I made the statement that MySQL will only use 1 index per statement, whatever you do.

This is no longer true, as a matter of fact, and that has been the case since MySQL 5.0 and I should have checked. MySQL is actually able to use index_merge. An explanation why I didn't look for thi more carefully, yes an explanation, not an excuse, is that the optimizer doesn't seem to want to use this very often. Which is too bad.

So, with this in mind, and using the same table as in the previous post, let's look at index_merge in action. Or possibly, not so much in action. Let's recap what the table looks like:
CREATE TABLE `product` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`brand_id` int(11) NOT NULL,
`quantity` int(11) NOT NULL,
`weight` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `ix_brand` (`brand_id`),
KEY `ix_weight_brand` (`weight`,`brand_id`),
KEY `ix_quantity_brand` (`quantity`,`brand_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2321268 DEFAULT CHARSET=utf8
OK, fair enough, one single table with a bunch of indexes. Looking at the index_merge documentation, we can see that if we have an OR condition with both sides appropriately indexed, then this algoritm would execute each path and then do a sort-merge of the result. Let's try with a simple example, using a similar query to the one used last time, except that we are to ignore the brand_id column this time:
EXPLAIN SELECT id FROM product WHERE weight = 41 OR quantity = 78;
and we get this:
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+-------+------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+-------+------------------------------------------------------------------+
| 1 | SIMPLE | product | index_merge | ix_weight_brand,ix_quantity_brand | ix_weight_brand,ix_quantity_brand | 4,4 | NULL | 25729 | Using sort_union(ix_weight_brand,ix_quantity_brand); Using where |
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+-------+------------------------------------------------------------------+
That is cool! We are seeing index_merge in action here! Coolness! So, knowing that, let's see if we can get index_merge to work for us in the case which we looked at last time, where we also had a brand_id column in the query. There are indexes on brand_id combined with both quantity and weight, the exact same two indexes used above actually, so adding brand_id should produce the same nice execution plan, but of course reduce the number of rows returned. Lets try it
EXPLAIN SELECT sql_no_cache id FROM product WHERE (brand_id = 6 AND weight = 41) OR (brand_id = 6 AND quantity = 78)
And we get this:
+----+-------------+---------+------+--------------------------------------------+----------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------+--------------------------------------------+----------+---------+-------+------+-------------+
| 1 | SIMPLE | product | ref | ix_brand,ix_weight_brand,ix_quantity_brand | ix_brand | 4 | const | 4291 | Using where |
+----+-------------+---------+------+--------------------------------------------+----------+---------+-------+------+-------------+
No luck there. For some reason, the optimizer seems to dislike the index_merge access method, except in the most obvious of cases. But hey, we don't give up that easily, do we, we can use a force index, right? Like this:
EXPLAIN SELECT sql_no_cache id FROM product FORCE INDEX (ix_weight_brand,ix_quantity_brand) WHERE (brand_id = 6 AND weight = 41) OR (brand_id = 6 AND quantity = 78);
And the result is this:
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+------+-------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+------+-------------------------------------------------------------+
| 1 | SIMPLE | product | index_merge | ix_weight_brand,ix_quantity_brand | ix_weight_brand,ix_quantity_brand | 8,8 | NULL | 31 | Using union(ix_weight_brand,ix_quantity_brand); Using where |
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+------+-------------------------------------------------------------+
What is annoying here is that this query, using FORCE INDEX actually hits only 31 rows according to the statistics, whereas the one not using FORCE INDEX potentially hits 4291. Why the optimizer determines the latter to be faster I do not know, but it doesn't seem right to me.

In the example given here, I was using a brand_id of 6. That particular brand_id has less entries than the other ones, so lets giva a shot using brand_id 4, which takes a bit longer. The SELECT using a UNION then looks like this:
SELECT sql_no_cache id FROM product WHERE brand_id = 4 AND weight = 41 UNION SELECT id FROM product WHERE brand_id = 4 AND quantity = 78;
and the one using FORCE INDEX looks like this:
SELECT sql_no_cache id FROM product FORCE INDEX (ix_weight_brand,ix_quantity_brand) WHERE (brand_id = 4 AND weight = 41) OR (brand_id = 4 AND quantity = 78);
Both of these use the same access path: A merge sort using the indexes ix_weight_brand and ix_quantity_brand. Which one do I prefer then? My personal opinion (but it is just that: An opinion that is personal) is to use the UNION, based on four facts:
  • When running these two statements, side by side on the same data and using SQL_NO_CACHE (i.e. not using the query cache), the UNION is consistently faster. Not that the index_merge is much slower or anything, in particular not compared to when using the ix_brand index that is preferred by the optimizer, unless I tell it not to, but the UNION is still faster.
  • Sometimes I could see the optimizer still not doing it's job correctly, even with the FORCE INDEX in place. In some cases only one of the indexes I forced would be used. Don't ask me why, and I cannot reproduce it now, so maybe it was my eyeglassed having fun with me.
  • The UNION construct means that I can use ANSI SQL, the FORCE INDEX not so. This is important to me, as I want to keep my options open when it comes to databases. Which doesn't mean I always use ANSI SQL, but if I have the choice between ANSI and non-ANSI SQL for two statements that are otherwise similar, I choose ANSI SQL.
  • I have a feeling that in my case, the UNION will be more flexible. If more indexes and conditions are added, the FORCE INDEX part will be difficult to maintain, whereas in the UNION this will be easier. Which doesn't mean that I particularily enjoy using a specific SQL declarative construct to optimze performance, I would much rather want the optimizer to deal with this for me. But it doesn't.
In conclusion, yes, I was wrong, I admit it, MySQL sure can use two indexes and do an index merge. But I was right in the sense that this seems to happen rarely, and that the optimizer isn't really doing it's job properly here anyway. But I am glad there is some openings for fixing this, as an access methods exists and the optimizer knows about it.

/Karlsson
Who was wrong! I admit it!

4 comments:

Sergey P. said...

Hello!

It's interesting that while you're complaining that the optimizer doesn't use index_merge when it should, we've had to add a way to turn index_merge off (http://dev.mysql.com/doc/refman/5.1/en/switchable-optimizations.html), because people were complaining that the optimizer used to pick index_merge and that made things slower in comparison with MySQL 4.1.

Sergey P. said...

Lets try it
EXPLAIN SELECT sql_no_cache id FROM product WHERE (brand_id = 6 AND weight = 41) OR (brand_id = 6 AND quantity = 78)
And we get this:


One needs to see the table DDL to tell for sure, but it looks like you're hitting a long-known optimizer deficiency. The good news is that we have a fix for it in MariaDB 5.3 (which is going to be released very soon)

More details here:
http://kb.askmonty.org/en/fair-choice-between-range-and-index_merge-optimizations

Or, in more colorful form, here:
http://en.oreilly.com/mysql2011/public/schedule/detail/19899, slides #5-#8

Daniƫl van Eeden said...

Have you tried USE INDEX instead of FORCE INDEX?

caris said...

https://my.georgeschool.org/ICS/Campus_Life/Campus_Groups/IB_Program/Discussion.jnz?portlet=Forums&screen=PostView&screenType=change&id=fb6ab7f7-e9b4-45e2-acd6-73029342d009