Denormalized Databases: A better example

There were a number of comments about my recent article on the negative effects of too much database normalization so allow me to expand the topic a little more. The most consistent comment I saw was that while many of you agreed with me in principle that too much normalization can lead to poor performance, the country name example was a bad choice. As a former teacher, I tend to choose illustrative examples that are easy to understand and explain. As a blog writer, I also tend to forget the web is full of nitpickers that will argue with you unless presented with a concrete example from the real world.

My favorite comment was about how in Europe a poster’s country name has changed 3-4 times in the last few years. I could argue most Americans would be shocked to see their country name or state name change, but I have to think… how hard is it to update the name of a country in a database even if it does change once a year? The trick with database denormalization is you have to write code that updates two separate sets of items, and in a lot of cases this is less difficult than many of you imagine. If updates are uncommon (static) as I mentioned, the performance gain from reading can far outweigh the rare cost of performing an update.

Perhaps its better to move on to a better example of where denormalization of data can play an important part: managing user permissions. And no, this is not to be confused with database security (which is completely useless for user/application-level permission management). Let’s say you have a tree-based system of widgets. Widgets can have sub-widgets and so on such that each widget has only one parent but can have many children. Now let’s say users within the application (again not to be confused with database users) are assigned read/write roles on individual widgets as well as to entire subtrees of widgets. For example, Bob may be able to read the entire tree, but he can only write to select widgets A, B, and C or a select sub-tree of widgets D which contains X, Y and Z. Now, how would you design a database system that allowed for quick determination of a user’s permissions on a node, considering the user may be accessing dozens or hundreds of nodes at a time, such as in a tree viewer?

Since the height of the tree is unbounded, we would need a database schema that allowed for arbitrary height. A common solution is to have a single table, let’s call it WidgetTable, with a field for the parent reference, let’s call it ParentId. Obviously, the root node would have a null ParentId. Given a node X, how do I quickly determine if X is writable? In this case, the user may have write access to X directly or through a parent node somewhere up the tree.

The most common answer, since SQL doesn’t support unlimited-recursive queries of this nature, is to query the parent and check if its writable, then query the parent’s parent and check if that’s writable, and so on until you find a writable permission for the group or you hit the root. Well, let’s do worst case asymptotic analysis on this! The worst case would be if you had one really long chain of single nodes with X being the leaf and the writable flag being on the root. In this case you need to perform O(n) number of queries where n is the number of nodes in the database. The performance of this solution in worst case could be awful.

Amortized (average case) analysis would find better bounds, but remember the earlier stipulation that you may be viewing dozens or hundreds of widgets at once? In worst case, that would be O(n*m) where n is the number of nodes in the table, and m is the number of nodes you are looking up. One of the best solutions in a situation like this is to maintain a denormalized lookup table, let’s call it UserToWidget that maps users to nodes they currently have permissions on, taking all parent relationships into account. With such a table in place (and hopefully good indexes on that table), determining a user’s permissions on a large set of nodes can be done in O(m) time, or near constant depending on the number of nodes you are looking up.

What’s the cost here? Maintaining the UserToWidget table of course! In Oracle, there are options such as materialized views which can help. You could also use database triggers to maintain the relationships anytime the higher-level permissions change. But let’s say you want a solution that isn’t database specific, which most of us do, then you could write application code that acts anytime a user’s permissions are updated, to seek out and update changes to affected nodes below the node you updated. If database reads are more common than database writes (which I suspect they are if you’re viewing hundreds of nodes at a time), the cost of performing the complex update will be far less than the cost of quickly being able to read a widget and determine its permission for a given user.

I hope this example helps to illustrate cases where pure normalization can be too costly for practical use. If not, try reading it again 😉

Why too much Database Normalization can be a Bad Thing

As someone with a long history in database optimization and who even did their Master’s project on Database normalization, I’m probably the last person in the world to argue against database normalization. From a theoretical standpoint, database normalization is a wonderful thing, helping to organize your data into easy-to-manage and understandable parts. For this article, I will play devil’s advocate and argue why too much normalization can be a bad thing.

Database

The years of working in the professional software industry has taught me the practical implications of a fully normalized database system: it’s often slow. It’s a non-trivial issue to design a good database schema that is both fully normalized and performs well on a variety of circumstances. I think back to some systems that I have seen in which a single common query joins dozens or more very large tables. Keep in mind, the more normalized your data is, the more joins that are required. For example, if you normalize a user’s address into a separate table (so the user can have a set of addresses) you have to join the user table with the address table every time you want to display their address. Then, you often have to join with a city, state, and country tables for full normalization.

Let’s take some of the advantages of database normalization and look at why they might not be so great:

Normalization saves space, but space is cheap!
The most obvious advantage of normalization is spacing saving. For example, instead of listing a “United States of America” for 10,000 records in a Users table, I can create a country table that lists the text once, then create a reference with an integer. Clearly, 10,000 integers take less space than 10,000 24-digit text fields. But these days, space is cheap! Terabyte drives are now common, so would normalizing some of the fields of a table really save much space? Probably not. I’m not saying denormalize all fields, but there’s some where the advantages to space are negligible.

Normalization simplifies updates, but reads are more common!
Another common reason for normalization is to simplify updates and reduce anomalies. For example, in the case of “United States of America” text in the Users table, its a lot easier to update/change the text in a single record than it is to update 10,000 records. But that brings up an interesting point, how often does the text “United States of America” change? I would argue almost never. There are examples where data does change more frequently, such as a user’s address, but its common knowledge in database systems that in most tables reads are far more frequent than writes. Therefore, if you have a table with relatively stable data that changes infrequently, normalization isn’t buying you a lot.

Performance, performance, Performance
Database administrators spend the bulk of their time worrying about performance, how to optimize queries and table structure, and in that regard normalization rarely helps. Those calls people tend to get at 2AM about the system crashing over a specific query? They often related to normalization in some way. If you have a case where you can eliminate a normalized table with minimal impact on functionality, it is better to do so. For example, in the case of address you might need users to be able to have multiple addresses (1-to-many relationship), therefore there’s no way to avoid normalizing the data into a separate table. But there are other cases such as with States and Countries, where constantly joining to two completely static tables is not helping. You can still have the tables present in order for a user to select a state or country from a drop-down list, but it may be better to save the text of the state or country in the user’s table, instead of a reference.

Conclusion and Next Article
In this article I played devil’s advocate arguing that too much normalization can be a bad thing. On the other end of the spectrum, too little normalization can also be a bad thing. The mark of a good database designer and software developer is the ability to find a good balance between the two that matches the database structure against the actual or expected use of the system. In the next article, I will discuss some of the tools available in most databases to help combat the performance issues of normalization such as indexes, triggers, and materialized views.

Never return Null Arrays!

Continuing on Jeanne’s theme of nulls, its a pet peeve of mine when I come across code that returns null arrays instead of empty arrays. The purpose of this post is to discuss some of the reasons why its a good practice to return empty arrays over null arrays, including Collection objects or typed array.

Null Pointer Exception

Consider the reusability of the following code:

public List getItems() {
   ...
   // There are no items, return null
   return null;
}

Let’s say you want to iterate on the results, the following code would throw a NullPointerException if used in conjunction with the code above:

List items = getItems();
for(int i=0; i<items.size(); i++) {
   // If there are no items, this code will throw a null pointer
}

Iterating on an array is one of the most commons practices in Java, so returning null instead of an empty array means the person is going to have to do extra null checks when they really don’t need to. In this situation, an empty array would suffice and would not produce any errors.

To summarize, it makes the code shorter, easier to read, and less likely to throw a NullPointerException. Also, there is some confusion with returning null since you may (or may not) be saying an empty array is the same thing as null. For example, the following would be confusing logic:

List items = getItems();
if(items == null) {
   // If null do one thing
} else if(items.size==0) {
   // If empty do another?
}

In this situation its not clear why you might act differently on the two return values and this can lead to confusing, ambiguous code.

The only time I might consider returning null valid is if there was an error of some kind, but then you should be throwing an exception (something more programmers are hesitant to do). Think of it like this, would you rather read a stack trace with a detailed error what went wrong in the code created by you, or would you rather see NullPointerException and wonder what of a dozen objects might have been null?