Some of my readers may be aware that nested subqueries such as “SELECT * FROM widgets WHERE id IN (SELECT …)”, don’t work all that well in MYSQL. While the syntax is usually correct, the performance issues in practice can be horrendous. This article delves deeper into this issue, and why MySQL performs so poorly with nested subqueries, but not so deep as to drive us all crazy.
Background
The first complex query I learned how to write was a nested subquery, or just nested query for short. At the time I was learning SQL and databases, it was the simple and most obvious of all the complex queries/joins: Find a set of records whose Id is in a list of outputs of another query.
SELECT id,name,price FROM widgets WHERE id IN (SELECT DISTINCT widgetId FROM widgetOrders)
In the example above, we first query the widgetOrders table for all unique widgets that have been sold based on widgetId (the DISTINCT doesn’t change the output of the query, but can help performance). After we have such a list, we select the id, name, and price of those widgets using data from the widget table.
First off, why do people like nested queries? As I said, they are pretty easy to understand, ESPECIALLY for non-programmers. But what are the alternatives to nested queries? Joins! Non-programmers (and even some programmers) find joins to be mystifying and scary things. Words like INNER JOIN, RIGHT OUTER JOIN, or even the shortcut symbols *= scare a lot of people. For example, the previous query can be rewritten as an INNER JOIN as follows:
SELECT DISTINCT w.id,w.name,w.price FROM widgets w INNER JOIN widgetOrders o ON w.id=o.widgetId
So why is this scarier? First, while aliases like ‘w’ and ‘o’ for table names were previously optional, they become almost required with complex joins, since we’re essentially mixing a two level query into a single level. Also, we have to add new syntax such as INNER JOIN … ON. There’s a lot more going on, and a lot more for beginners to pick up and/or be scared off by.
Why nested joins are bad in theory
The first big question of this article revolves around how query optimizers work. You can write a query a thousand different ways that would all output the same information and might seem equivalent to you, but query optimizers are just not that smart. The search space they have to cover to effectively optimize a query is massive, longer than could ever be searched in a reasonable amount of time. Therefore, query optimizes are often collections of greedy algorithims. Sure, they will do intelligent things when they can figure them out in time, but often they just look for the ‘quick path out’ using some simple heuristics. If a query optimizer thinks a particular plan may be the fastest, it won’t necessarily spend time verifying it; it will just act. Because of this, it is very easy to trip up or hinder a query optimizer.
This brings us to nested queries. Even the best query optimizers in the best database software available have trouble with nested queries. This is because they often cannot optimize them in any reasonable manner. As we saw in the example, I took two separate queries and merged them into one. Most query optimizers are not smart enough to do this since finding such a conversion would take too long, or in computing terms would require too large a search space and near-infinite time. In fact, many query optimizers will flat out refuse to optimize nested queries if it sees them. Therefore, a general rule of thumb is to avoid nested queries as much as possible since you are essentially blocking the query optimizer from touching that part of the query. You should stick with more traditional joins as much as possible since this encourages the query optimizer to find better query paths.
Why nested joins are really bad in MySQL
While nested queries may have been the first type of complex query I worked with, I never had serious problems with them and never spent hours reworking them to non-nested queries, until I started working in MySQL. Many nested queries you might easily write are capable of completely grinding your MySQL database to a halt under certain data conditions. MySQL has posted a list of excuses and tips (to fix your queries instead of their code) and there’s numerous forums posts, blogs, bug reports, and articles discussing the issue, but I’ll streamline it for you: MySQL does terrible things when handling nested subqueries; therefore, if you are using MySQL they should be avoided at all costs.
Note: This does not mean you should avoid the IN or NOT IN syntax, for example “WHERE id IN (1,2,3)” is just fine. The problems is when “1,2,3” is replaced with a subquery such as “SELECT …”.
But Scott, I need a nested query!
If you absolutely need a nested query, you can always perform two distinct queries in your application as such:
Set X = CallDatabase("SELECT DISTINCT widgetId FROM widgetOrders"); CallDatabase("SELECT id,name,price FROM widgets WHERE id IN ("+X+")");
As strange as it sounds to recommend two database calls over one, there are many real cases in MySQL where this will perform better than nested queries.
The Future
The problem with nested queries is that in many circumstances they will perform just fine, but change the data slightly and they can seriously harm database performance in MySQL. For example, strange things can happen if the subquery returns no records so that you end up with “WHERE id IN ()”. Many of the issues with subqueries have been logged as bug on MySQL’s support site, so it’s possible in future versions they will be safer to use. For now though, avoid them as long as you program with MySQL, lest you want to create headaches for yourself down the road.
Just as a side not, rather than using a “in/not in” predicament, use an “exists/not exists”
SELECT id,name,price
FROM widgets as w
WHERE exists (
SELECT 1
FROM widgetOrders as o
WHERE o.widgetId=w.id
)
This syntax is already lighter than an EXISTS.
At least, it is with Postgresql and ms sql server. I have not used Mysql for many years (6 or 7, I think), thus I don’t know where it stands now.
Hi Tripy,
That’s a good suggestion. It feels a little bit like a hack, although I can’t imagine the performance is worse than some of the things that happen with IN/NOT IN.
Thanks!
-Scott
Ok. But what about when you have hierarchical data, such as:
SELECT id_section, id_parent AS aka_parent, barnav, lastmod
FROM sections
WHERE active=1
AND (SELECT active FROM sections WHERE id_section=aka_parent)=1
ORDER BY aka_parent;
There you need to use INSIDE the nested query a value retrieved as an alias in the main query.
So, if you are looping this, and you go the distinct query way you may not get the same results. Isn’t it?
@SomeWhat
That query can be easily rewritten using an inner join. There’s no rule saying that you cannot join a table on itself. It would be a worthwhile exercise for you to figure out how to rewrite the query without using a nested join.
Sincerely, I quite bitched while reading your answer, but your dogmatism paid off. I see now that the way should have been:
SELECT DISTINCT children.id_section, children.id_parent, children.barnav, children.lastmod
FROM sections children LEFT JOIN sections parents ON children.id_parent=parent.id_section
WHERE children.active=1
AND parents.active=1
ORDER BY children.id_parent;
Worked seemlessly.
Thanks -specially- for the wisdom within, man.
Today I used my first nested query, and as such immedietely googled it. I’ve been in databases a long time, and use joins, functions, unions, everything, on a regular basis, but nested queries always seemed messy to me, hard to organise and almost redundant.
Today I needed one for the first time, and why? I did my query directly into the mysql terminal (I rarely do this).
I had to use:
SELECT
count(t1.id_address), min(t1.id_address)
FROM
address t1
WHERE
id_address >= (SELECT
min(id_address)
FROM
applicant t2
WHERE
dt_complete > CURDATE() – INTERVAL 1 YEAR)
It’s so messy. But it was a quick hack. I still see no alternative, since address is used by like 4 other tables.
Posting again, want it to notify me of replies. Silly me.
Pingback: MySQL and nested queries « Ablue Blog
Pingback: Are lots of queries faster in MySql than a big JOIN? | Smash Company
I’m just starting to get into more complicated MYSQL queries as my job requires this but I can’t understand them at all! does anyone know a good resource for learning about joining in mysql queries?
You should focus on learning Database Joins, not MySQL Joins, as the concepts are the same across databases. I’d spend time learning the theory behind joins and then it won’t seem to complicated.
thank you for this! I was already doing a little bit of joining, but preferred writing nested queries for the sake of simplicity … now things were getting more complicated and I’m glad to have read this …
time difference for me: -0.005458 seconds compared to -0.545683 … That’s a 100 times faster!
Before:
SELECT * FROM `content`
WHERE `content`.`predicate_id` = ‘3’
AND `content`.`object_id` IN (‘x’,’y’,’z’)
AND `content`.`subject_id` IN
(SELECT `subject_id` FROM `content`
WHERE `content`.`predicate_id` = ‘some_var’
AND `content`.`object_id` = ‘:some_value’)
time: -0.545683
With join:
SELECT * FROM `content`
LEFT JOIN `content` AS `content2` on `content`.`subject_id` = `content2`.`subject_id`
WHERE `content`.`predicate_id` = ‘3’
AND `content`.`object_id` IN (‘x’,’y’,’z’)
AND `content2`.`predicate_id` = ‘some_var’
AND `content2`.`object_id` = ‘:some_value’
time: -0.005458!
I was using a nested query to re-order results:
“SELECT *
FROM (
SELECT *
FROM messages
ORDER BY message_id DESC
LIMIT 15
)
ORDER BY message_id ASC”
Is it possible to write this as a join? I tried for hours but had to resort to this!
Is this still true in 2015 or did they improve optimization or did something about this?
They closed the bug you mentioned in the article SIX years and seven months later it was opened:
http://bugs.mysql.com/bug.php?id=17952
Better late than ever. Can you please update your article as well?
@ The Lone Man
Nested queries are still considered a poor practice in general, even if MySQL did close this issue. For general performance reasons, you should still avoid them in all DBMS’s whenever possible.
I changed the nested queries:
SELECT id, sender
FROM msgTable
WHERE sender=’John’
OR id IN (SELECT id FROM receiverTable WHERE receiver=’John’)
to the query with JOIN:
SELECT M.id, M.sender
FROM msgTable M LEFT JOIN receiverTable R ON M.id=R.id
WHERE M.sender=’John’
OR R.receiver=’John’
The OR’ed conditions worry me. Will the database smart enough to use the conditions separately for each table to limit the join size, or will it end up creating a large join table before applying the OR’ed conditions in WHERE? If it is the latter, will it better to just use nested queries, rather than JOIN?