Denormalized Databases: A better example

There were a number of comments about my recent article on the negative effects of too much database normalization so allow me to expand the topic a little more. The most consistent comment I saw was that while many of you agreed with me in principle that too much normalization can lead to poor performance, the country name example was a bad choice. As a former teacher, I tend to choose illustrative examples that are easy to understand and explain. As a blog writer, I also tend to forget the web is full of nitpickers that will argue with you unless presented with a concrete example from the real world.

My favorite comment was about how in Europe a poster’s country name has changed 3-4 times in the last few years. I could argue most Americans would be shocked to see their country name or state name change, but I have to think… how hard is it to update the name of a country in a database even if it does change once a year? The trick with database denormalization is you have to write code that updates two separate sets of items, and in a lot of cases this is less difficult than many of you imagine. If updates are uncommon (static) as I mentioned, the performance gain from reading can far outweigh the rare cost of performing an update.

Perhaps its better to move on to a better example of where denormalization of data can play an important part: managing user permissions. And no, this is not to be confused with database security (which is completely useless for user/application-level permission management). Let’s say you have a tree-based system of widgets. Widgets can have sub-widgets and so on such that each widget has only one parent but can have many children. Now let’s say users within the application (again not to be confused with database users) are assigned read/write roles on individual widgets as well as to entire subtrees of widgets. For example, Bob may be able to read the entire tree, but he can only write to select widgets A, B, and C or a select sub-tree of widgets D which contains X, Y and Z. Now, how would you design a database system that allowed for quick determination of a user’s permissions on a node, considering the user may be accessing dozens or hundreds of nodes at a time, such as in a tree viewer?

Since the height of the tree is unbounded, we would need a database schema that allowed for arbitrary height. A common solution is to have a single table, let’s call it WidgetTable, with a field for the parent reference, let’s call it ParentId. Obviously, the root node would have a null ParentId. Given a node X, how do I quickly determine if X is writable? In this case, the user may have write access to X directly or through a parent node somewhere up the tree.

The most common answer, since SQL doesn’t support unlimited-recursive queries of this nature, is to query the parent and check if its writable, then query the parent’s parent and check if that’s writable, and so on until you find a writable permission for the group or you hit the root. Well, let’s do worst case asymptotic analysis on this! The worst case would be if you had one really long chain of single nodes with X being the leaf and the writable flag being on the root. In this case you need to perform O(n) number of queries where n is the number of nodes in the database. The performance of this solution in worst case could be awful.

Amortized (average case) analysis would find better bounds, but remember the earlier stipulation that you may be viewing dozens or hundreds of widgets at once? In worst case, that would be O(n*m) where n is the number of nodes in the table, and m is the number of nodes you are looking up. One of the best solutions in a situation like this is to maintain a denormalized lookup table, let’s call it UserToWidget that maps users to nodes they currently have permissions on, taking all parent relationships into account. With such a table in place (and hopefully good indexes on that table), determining a user’s permissions on a large set of nodes can be done in O(m) time, or near constant depending on the number of nodes you are looking up.

What’s the cost here? Maintaining the UserToWidget table of course! In Oracle, there are options such as materialized views which can help. You could also use database triggers to maintain the relationships anytime the higher-level permissions change. But let’s say you want a solution that isn’t database specific, which most of us do, then you could write application code that acts anytime a user’s permissions are updated, to seek out and update changes to affected nodes below the node you updated. If database reads are more common than database writes (which I suspect they are if you’re viewing hundreds of nodes at a time), the cost of performing the complex update will be far less than the cost of quickly being able to read a widget and determine its permission for a given user.

I hope this example helps to illustrate cases where pure normalization can be too costly for practical use. If not, try reading it again 😉

Why too much Database Normalization can be a Bad Thing

As someone with a long history in database optimization and who even did their Master’s project on Database normalization, I’m probably the last person in the world to argue against database normalization. From a theoretical standpoint, database normalization is a wonderful thing, helping to organize your data into easy-to-manage and understandable parts. For this article, I will play devil’s advocate and argue why too much normalization can be a bad thing.

Database

The years of working in the professional software industry has taught me the practical implications of a fully normalized database system: it’s often slow. It’s a non-trivial issue to design a good database schema that is both fully normalized and performs well on a variety of circumstances. I think back to some systems that I have seen in which a single common query joins dozens or more very large tables. Keep in mind, the more normalized your data is, the more joins that are required. For example, if you normalize a user’s address into a separate table (so the user can have a set of addresses) you have to join the user table with the address table every time you want to display their address. Then, you often have to join with a city, state, and country tables for full normalization.

Let’s take some of the advantages of database normalization and look at why they might not be so great:

Normalization saves space, but space is cheap!
The most obvious advantage of normalization is spacing saving. For example, instead of listing a “United States of America” for 10,000 records in a Users table, I can create a country table that lists the text once, then create a reference with an integer. Clearly, 10,000 integers take less space than 10,000 24-digit text fields. But these days, space is cheap! Terabyte drives are now common, so would normalizing some of the fields of a table really save much space? Probably not. I’m not saying denormalize all fields, but there’s some where the advantages to space are negligible.

Normalization simplifies updates, but reads are more common!
Another common reason for normalization is to simplify updates and reduce anomalies. For example, in the case of “United States of America” text in the Users table, its a lot easier to update/change the text in a single record than it is to update 10,000 records. But that brings up an interesting point, how often does the text “United States of America” change? I would argue almost never. There are examples where data does change more frequently, such as a user’s address, but its common knowledge in database systems that in most tables reads are far more frequent than writes. Therefore, if you have a table with relatively stable data that changes infrequently, normalization isn’t buying you a lot.

Performance, performance, Performance
Database administrators spend the bulk of their time worrying about performance, how to optimize queries and table structure, and in that regard normalization rarely helps. Those calls people tend to get at 2AM about the system crashing over a specific query? They often related to normalization in some way. If you have a case where you can eliminate a normalized table with minimal impact on functionality, it is better to do so. For example, in the case of address you might need users to be able to have multiple addresses (1-to-many relationship), therefore there’s no way to avoid normalizing the data into a separate table. But there are other cases such as with States and Countries, where constantly joining to two completely static tables is not helping. You can still have the tables present in order for a user to select a state or country from a drop-down list, but it may be better to save the text of the state or country in the user’s table, instead of a reference.

Conclusion and Next Article
In this article I played devil’s advocate arguing that too much normalization can be a bad thing. On the other end of the spectrum, too little normalization can also be a bad thing. The mark of a good database designer and software developer is the ability to find a good balance between the two that matches the database structure against the actual or expected use of the system. In the next article, I will discuss some of the tools available in most databases to help combat the performance issues of normalization such as indexes, triggers, and materialized views.

5 Tips to be a Good JDBC Programmer

Have you written Java code for your database connections in JDBC that ends up being thrown away? Would using a new database software easily kill your product and set you back months at a time to port? Well, if so, keep reading as the goal of this article is to make you a better JDBC programmer.

Database

    Tip #1: ALWAYS close connections

I wrote an entry about closing database connections back in July so I won’t belabor the point. Suffice it to say, one of the most common mistakes JDBC programmers make is failing to close database resources like result sets, statements, and connections. Unfortunately, because of the nature of the problem a developer will often never see the error until it is discovered in a production environment since most developers test with one process at a time. In short, every time you open a connection you should close it sometime later and that close should be nearly bullet proof, as so:

Connection con;
try {
   con = ...
} finally {
   try { if(con!=null) {con.close();}}
   catch (Exception e1) {}
}

As mentioned in the post about the subject, you can replace the finally code with a call to a method such as closeConnection() which does the same thing.

    Tip #2: Always use PreparedStatements over regular Statements

As you may be aware, PreparedStatements come with many advantages over regular Statements:

  • Sanitizes your database inputs: Great Example
  • Organizes your statements into set of input/output commands
  • Performance boost: The pre-compiled statement can be reused multiple times

The mistake a lot of programmers make is when they say “Well, I have no inputs to the query, so I’m not going to use PreparedStatements”. First off, they may have inputs down the road they needed to ‘tack on’ to the query, and it will save a lot of time rewriting the code if the query is already structured as a PreparedStatement. Second, perhaps there are some hard-coded input values that you can’t imagine changing, but could change down the road. Better to use a PreparedStatement and a static object, than to hard-code it, as shown in this example:

Statement statement = con.createStatement();
statement.executeQuery("SELECT name FROM widgets WHERE type = 'WidgetB'");

versus:

final String widgetType = "WidgetB";
Statement pStatement = con.prepareStatement("SELECT name FROM widgets WHERE type = ?");
pStatement.setString(1,widgetType);
pStatement.executeQuery();

Sure, the first example is shorter, but the second gives you the freedom to easily change the Widget Type down the road as well as supporting additional inputs since the setup work is all ready done.

    Tip #3: Be Prepared To Change Databases

It’s not a myth; database platforms do change. You may be working on a project using Oracle and you get a request from your manager for an estimate to port the code to MySQL. It can and does happen, the question is are you going to respond with “a couple of days/weeks” or “12-24 months”. Your answer should shed some light on how good your code really is.

    Tip #4: Never Reference Database-Specific Libraries

If you ever find yourself writing a line of code such as this:

OracleCallableStatement cst = (OracleCallableStatement)conn.prepareCall ...

Stop! This is an example where a regular Callable Statement is likely called for, but you’re using a database-specific line in your code. The whole purpose of JDBC driver pattern is that you can drag and drop new JDBC driver libraries for MySQL, Oracle, etc into your application and the software will still compile and run just fine. Lines of code like the one above require you have the JDBC library available at compile-time and guarantee you’ll have to rewrite code to port to another language. In other words, writing a line of code like this is a developer admitting to themselves “I will definitely need to rewrite this line of code if I want to port to another database”.

    Tip #5: Never use Database Stored Procedures*

Never, ever, write a Java application that relies, in large part, on stored procedures. Not only can they never be ported to a different database, 9 times out of 10, you can’t even recall exactly what they did when you wrote them. They fall into the same category of Perl code, which is WORN (write once, read never), since most of the time reading them is harder than rewriting them, mostly because TSQL/PSQL are not very pretty languages to work with.

* There some cases where stored procedures are allowed, such as reporting statistics or materialized views, but these are few and far between. The bulk of your main application logic, such as adding new users, performing basic transactions, and editing data should not rely on stored procedures in any way. Reporting is allowed here because often times performing metrics on millions of records in JDBC would require too many round trips over the network. In other words, reporting requires the kind of direct network access stored procedures guarantee since they run on the database itself. But if you are stuck using them, you should insulate them so they don’t interact with any of your java code directly, such as an independent nightly build of statistics.

    Bonus Tip #6: Never put JDBC code inside JSPs

I started with 5 tips but our reader’s comments reminded me of a 6th tip: Never put JDBC code of any kind inside a JSP! The purpose of a 3-tiered architecture is to get good separation of layers, and putting JDBC code inside the presentation tier skips the middle layer all together. Some developers reduce the impact of this mistake by putting their connection code inside a Java file and only passing the result set to a JSP. This is still really bad! JSPs should have no JDBC code whatsoever since most of the time queries can be reused by multiple modules and JSPs don’t exactly lend themselves to resusability.