Memo: Avoid Nested Queries in MySQL at all costs

Some of my readers may be aware that nested subqueries such as “SELECT * FROM widgets WHERE id IN (SELECT …)”, don’t work all that well in MYSQL. While the syntax is usually correct, the performance issues in practice can be horrendous. This article delves deeper into this issue, and why MySQL performs so poorly with nested subqueries, but not so deep as to drive us all crazy.

Nested Queries

Background

The first complex query I learned how to write was a nested subquery, or just nested query for short. At the time I was learning SQL and databases, it was the simple and most obvious of all the complex queries/joins: Find a set of records whose Id is in a list of outputs of another query.

SELECT id,name,price 
FROM widgets 
WHERE id IN (SELECT DISTINCT widgetId FROM widgetOrders)

In the example above, we first query the widgetOrders table for all unique widgets that have been sold based on widgetId (the DISTINCT doesn’t change the output of the query, but can help performance). After we have such a list, we select the id, name, and price of those widgets using data from the widget table.

First off, why do people like nested queries? As I said, they are pretty easy to understand, ESPECIALLY for non-programmers. But what are the alternatives to nested queries? Joins! Non-programmers (and even some programmers) find joins to be mystifying and scary things. Words like INNER JOIN, RIGHT OUTER JOIN, or even the shortcut symbols *= scare a lot of people. For example, the previous query can be rewritten as an INNER JOIN as follows:

SELECT DISTINCT w.id,w.name,w.price 
FROM widgets w 
INNER JOIN widgetOrders o ON w.id=o.widgetId

So why is this scarier? First, while aliases like ‘w’ and ‘o’ for table names were previously optional, they become almost required with complex joins, since we’re essentially mixing a two level query into a single level. Also, we have to add new syntax such as INNER JOIN … ON. There’s a lot more going on, and a lot more for beginners to pick up and/or be scared off by.

Why nested joins are bad in theory

The first big question of this article revolves around how query optimizers work. You can write a query a thousand different ways that would all output the same information and might seem equivalent to you, but query optimizers are just not that smart. The search space they have to cover to effectively optimize a query is massive, longer than could ever be searched in a reasonable amount of time. Therefore, query optimizes are often collections of greedy algorithims. Sure, they will do intelligent things when they can figure them out in time, but often they just look for the ‘quick path out’ using some simple heuristics. If a query optimizer thinks a particular plan may be the fastest, it won’t necessarily spend time verifying it; it will just act. Because of this, it is very easy to trip up or hinder a query optimizer.

This brings us to nested queries. Even the best query optimizers in the best database software available have trouble with nested queries. This is because they often cannot optimize them in any reasonable manner. As we saw in the example, I took two separate queries and merged them into one. Most query optimizers are not smart enough to do this since finding such a conversion would take too long, or in computing terms would require too large a search space and near-infinite time. In fact, many query optimizers will flat out refuse to optimize nested queries if it sees them. Therefore, a general rule of thumb is to avoid nested queries as much as possible since you are essentially blocking the query optimizer from touching that part of the query. You should stick with more traditional joins as much as possible since this encourages the query optimizer to find better query paths.

Why nested joins are really bad in MySQL

While nested queries may have been the first type of complex query I worked with, I never had serious problems with them and never spent hours reworking them to non-nested queries, until I started working in MySQL. Many nested queries you might easily write are capable of completely grinding your MySQL database to a halt under certain data conditions. MySQL has posted a list of excuses and tips (to fix your queries instead of their code) and there’s numerous forums posts, blogs, bug reports, and articles discussing the issue, but I’ll streamline it for you: MySQL does terrible things when handling nested subqueries; therefore, if you are using MySQL they should be avoided at all costs.

Note: This does not mean you should avoid the IN or NOT IN syntax, for example “WHERE id IN (1,2,3)” is just fine. The problems is when “1,2,3” is replaced with a subquery such as “SELECT …”.

But Scott, I need a nested query!

If you absolutely need a nested query, you can always perform two distinct queries in your application as such:

Set X = CallDatabase("SELECT DISTINCT widgetId FROM widgetOrders");
CallDatabase("SELECT id,name,price FROM widgets WHERE id IN ("+X+")");

As strange as it sounds to recommend two database calls over one, there are many real cases in MySQL where this will perform better than nested queries.

The Future

The problem with nested queries is that in many circumstances they will perform just fine, but change the data slightly and they can seriously harm database performance in MySQL. For example, strange things can happen if the subquery returns no records so that you end up with “WHERE id IN ()”. Many of the issues with subqueries have been logged as bug on MySQL’s support site, so it’s possible in future versions they will be safer to use. For now though, avoid them as long as you program with MySQL, lest you want to create headaches for yourself down the road.

Denormalized Databases: A better example

There were a number of comments about my recent article on the negative effects of too much database normalization so allow me to expand the topic a little more. The most consistent comment I saw was that while many of you agreed with me in principle that too much normalization can lead to poor performance, the country name example was a bad choice. As a former teacher, I tend to choose illustrative examples that are easy to understand and explain. As a blog writer, I also tend to forget the web is full of nitpickers that will argue with you unless presented with a concrete example from the real world.

My favorite comment was about how in Europe a poster’s country name has changed 3-4 times in the last few years. I could argue most Americans would be shocked to see their country name or state name change, but I have to think… how hard is it to update the name of a country in a database even if it does change once a year? The trick with database denormalization is you have to write code that updates two separate sets of items, and in a lot of cases this is less difficult than many of you imagine. If updates are uncommon (static) as I mentioned, the performance gain from reading can far outweigh the rare cost of performing an update.

Perhaps its better to move on to a better example of where denormalization of data can play an important part: managing user permissions. And no, this is not to be confused with database security (which is completely useless for user/application-level permission management). Let’s say you have a tree-based system of widgets. Widgets can have sub-widgets and so on such that each widget has only one parent but can have many children. Now let’s say users within the application (again not to be confused with database users) are assigned read/write roles on individual widgets as well as to entire subtrees of widgets. For example, Bob may be able to read the entire tree, but he can only write to select widgets A, B, and C or a select sub-tree of widgets D which contains X, Y and Z. Now, how would you design a database system that allowed for quick determination of a user’s permissions on a node, considering the user may be accessing dozens or hundreds of nodes at a time, such as in a tree viewer?

Since the height of the tree is unbounded, we would need a database schema that allowed for arbitrary height. A common solution is to have a single table, let’s call it WidgetTable, with a field for the parent reference, let’s call it ParentId. Obviously, the root node would have a null ParentId. Given a node X, how do I quickly determine if X is writable? In this case, the user may have write access to X directly or through a parent node somewhere up the tree.

The most common answer, since SQL doesn’t support unlimited-recursive queries of this nature, is to query the parent and check if its writable, then query the parent’s parent and check if that’s writable, and so on until you find a writable permission for the group or you hit the root. Well, let’s do worst case asymptotic analysis on this! The worst case would be if you had one really long chain of single nodes with X being the leaf and the writable flag being on the root. In this case you need to perform O(n) number of queries where n is the number of nodes in the database. The performance of this solution in worst case could be awful.

Amortized (average case) analysis would find better bounds, but remember the earlier stipulation that you may be viewing dozens or hundreds of widgets at once? In worst case, that would be O(n*m) where n is the number of nodes in the table, and m is the number of nodes you are looking up. One of the best solutions in a situation like this is to maintain a denormalized lookup table, let’s call it UserToWidget that maps users to nodes they currently have permissions on, taking all parent relationships into account. With such a table in place (and hopefully good indexes on that table), determining a user’s permissions on a large set of nodes can be done in O(m) time, or near constant depending on the number of nodes you are looking up.

What’s the cost here? Maintaining the UserToWidget table of course! In Oracle, there are options such as materialized views which can help. You could also use database triggers to maintain the relationships anytime the higher-level permissions change. But let’s say you want a solution that isn’t database specific, which most of us do, then you could write application code that acts anytime a user’s permissions are updated, to seek out and update changes to affected nodes below the node you updated. If database reads are more common than database writes (which I suspect they are if you’re viewing hundreds of nodes at a time), the cost of performing the complex update will be far less than the cost of quickly being able to read a widget and determine its permission for a given user.

I hope this example helps to illustrate cases where pure normalization can be too costly for practical use. If not, try reading it again 😉

Why too much Database Normalization can be a Bad Thing

As someone with a long history in database optimization and who even did their Master’s project on Database normalization, I’m probably the last person in the world to argue against database normalization. From a theoretical standpoint, database normalization is a wonderful thing, helping to organize your data into easy-to-manage and understandable parts. For this article, I will play devil’s advocate and argue why too much normalization can be a bad thing.

Database

The years of working in the professional software industry has taught me the practical implications of a fully normalized database system: it’s often slow. It’s a non-trivial issue to design a good database schema that is both fully normalized and performs well on a variety of circumstances. I think back to some systems that I have seen in which a single common query joins dozens or more very large tables. Keep in mind, the more normalized your data is, the more joins that are required. For example, if you normalize a user’s address into a separate table (so the user can have a set of addresses) you have to join the user table with the address table every time you want to display their address. Then, you often have to join with a city, state, and country tables for full normalization.

Let’s take some of the advantages of database normalization and look at why they might not be so great:

Normalization saves space, but space is cheap!
The most obvious advantage of normalization is spacing saving. For example, instead of listing a “United States of America” for 10,000 records in a Users table, I can create a country table that lists the text once, then create a reference with an integer. Clearly, 10,000 integers take less space than 10,000 24-digit text fields. But these days, space is cheap! Terabyte drives are now common, so would normalizing some of the fields of a table really save much space? Probably not. I’m not saying denormalize all fields, but there’s some where the advantages to space are negligible.

Normalization simplifies updates, but reads are more common!
Another common reason for normalization is to simplify updates and reduce anomalies. For example, in the case of “United States of America” text in the Users table, its a lot easier to update/change the text in a single record than it is to update 10,000 records. But that brings up an interesting point, how often does the text “United States of America” change? I would argue almost never. There are examples where data does change more frequently, such as a user’s address, but its common knowledge in database systems that in most tables reads are far more frequent than writes. Therefore, if you have a table with relatively stable data that changes infrequently, normalization isn’t buying you a lot.

Performance, performance, Performance
Database administrators spend the bulk of their time worrying about performance, how to optimize queries and table structure, and in that regard normalization rarely helps. Those calls people tend to get at 2AM about the system crashing over a specific query? They often related to normalization in some way. If you have a case where you can eliminate a normalized table with minimal impact on functionality, it is better to do so. For example, in the case of address you might need users to be able to have multiple addresses (1-to-many relationship), therefore there’s no way to avoid normalizing the data into a separate table. But there are other cases such as with States and Countries, where constantly joining to two completely static tables is not helping. You can still have the tables present in order for a user to select a state or country from a drop-down list, but it may be better to save the text of the state or country in the user’s table, instead of a reference.

Conclusion and Next Article
In this article I played devil’s advocate arguing that too much normalization can be a bad thing. On the other end of the spectrum, too little normalization can also be a bad thing. The mark of a good database designer and software developer is the ability to find a good balance between the two that matches the database structure against the actual or expected use of the system. In the next article, I will discuss some of the tools available in most databases to help combat the performance issues of normalization such as indexes, triggers, and materialized views.