Which Database to Start With?

When people ask me how to learn to use a database or how to write SQL queries, I tell them to pick a database system and immerse themselves in it. In fact that advice goes for a lot of software technologies: just immerse yourself in a language, as programming tutorials are easy to come by these days. On the other hand, when people ask me which database software to use, I tend to give pause. Most of the time, I recommend MySQL for beginners since it tends to be the most light-weight system to install and use, but I know it’s not often the easiest to understand. With the advent of new light-weight database editions of often heavier products, perhaps it’s time I reconsider the issue.

1. MySQL: Free, lightweight, and readily available

MySQL stands out as the easiest for users to start with, in part because most people can get access to a MySQL database without having to setup anything. Most, if not all, hosting companies that offer database support do so in the form of a MySQL database. The only disadvantage with hosting solutions is that users lose the ability to run local applications on the database, often relying on phpMyAdmin for all database changes. I recommend anyone serious about learning MySQL download and install it themselves, as there are plenty of installation platforms supported.

The good: Free. Easy to download and/or find an existing database to work with. Somewhat easy to install. Lots of free tools available. Good documentation.
The bad: If the installation or auto-configuration breaks, user is left spending hours diagnosing the problems. The MySQL GUI tools, while nice, have to be downloaded separately from the server. Limited support. Clustering and support of large transaction systems is not uncommon. Also, it can be buggy and unpredictable at times, as I’ve seen in practice.

2. Oracle: Heavy and Powerful

Oracle is one of the oldest database systems and stands out as a powerhouse among databases given its vast support for advanced clustering, memory management, and query optimization. If you need something robust, powerful, and able to support millions or billions of transactions a day, it’s the best there is. Oracle needs to be licensed for a production environment, although developers can download a free limited-use version which is good for building an application.

The good: Powerful. Can do some really cool things for those that appreciate it. Extremely scalable.
The bad: Often large and time-consuming installation. Least user friendly of all the database systems, although it’s gotten better over the last few years. Not free. Not a wide variety of tools, free or otherwise, to manipulate the database.

3. Microsoft SQL Server: Easy to use administration interface, often powerful

Microsoft SQL Server has matured greatly over the last 10 years into a decent rival of Oracle. I like MS SQL Server in that it hides a lot of the underlying configuration information from the user. On the other hand, I dislike MS SQL server in that it hides a lot of the underlying configuration information from the user. Double-edged sword, I know. Like Oracle, you need a license if you want to use it in a production environment.

The good: Easy to set up new databases and administer them. Best for those who have no idea how to administer a database. New express editions can be used for free.
The bad: Over-simplifies a lot for advanced users, making it harder to optimize. Not free. Developer edition has nominal cost, although it probably should be free.

Other Databases

This article is not meant to be the end-all for database software discussion, but a beginning guide of the big three database systems for those who are not well-versed in the area. To cover every possible database software, such as PostgreSQL or DB2, as well as countless others, would take a book or two. Most students starting out just need to find a single database and start ‘playing’ with it until they get the hang of it, rather than an exhaustive discussion of which database is best.

Non-standard Databases

Some of you may be more familiar with embedded databases such HSQLDB, SQLite, or Derby than the ones I have mentioned. Rarely do I see beginners using embedded databases, so perhaps I’ll write an article about such systems down the road. Also, I have not purposely not mentioned Microsoft Access as a learning database, simply because I don’t consider it standard database software, but rather a glorified Excel spreadsheet. Most of teaching someone how to use a regular database after using Access, is convincing them all databases are not like Access.

My favorite database? If I’m teaching or writing a relatively simple web-application, MySQL. If someone else is paying for the license and the application is large enough, Oracle.

The Joy of Null: Continued

In Part 1 of The Joy of Null I discussed a variety of ways null-equivalent values make it into the software design. Often times, developer laziness or immutability of the database tier drives many developers to insert values that simulate null values, rather than using a database null itself. In this second half, I’ll talk about other ways null-equivalent values arise as well as problems associated with using them.

Null

Part 1: Revisited
The most consistent comment I saw for why some developers choose to use empty strings instead of null values was for performance reasons. As I’ve mentioned on this blog before, database query optimizers are often a loose association of greedy algorithms and indexes, and its certainly possible a DBMS may perform better when a field is an empty string rather than a null, but I doubt its a general rule. In fact, from a theoretical standpoint, testing whether a field is null should be faster than comparing two strings. Consider that testing whether a field is null is a binary test, it either is or it is not null. Alternatively, testing whether a string is empty, or in the more difficult case is a set of blank spaces equivalent to an empty string, requires a string comparison which, depending on how it’s implemented, could be slow.

I think when people consider performance, they might be referring to the fact that no string = null in a database. In order to query on null, you must use “IS NULL” syntax since (x = null) returns false for all values of x, null or otherwise. Yes, having nulls and empty strings in a query could complicate the logic, but your goal should be to remove the empty strings, not the null values.

Part 2: Data Ambiguity
The first problem with null-equivalent values is you may unintentionally have more than one of them. For example, if you consider an empty string to be a null-equivalent value, do you also consider a 1 or more blank spaces null-equivalent? If so, your software will have to formulate SQL queries with syntax such as “trim(widget) LIKE ””, or alternatively “length(trim(widget)) = 0”. Either way, you now have to perform a database function on every string in a table in order to determine if a value is present, whereas “IS NULL” should be a lot faster.

The second problem with null-equivalent values is you may intentionally have more than one of them. I’m going to refer to enumerated null values as a set of null values that all represent null, but might mean slightly different things. As one reader pointed out, you may have a DBMS that supports enumerated null values such as NULL_MISSING, NULL_REMOVED, NULL_UNKNOWN, etc. I have no objection to using enumerated nulls except that very few, if any, major DBMS systems support it due to the fact it would be difficult to get a group of developers and DBA’s to agree on an enumerated set of nulls that would work across all domains. With that in mind, the vast majority of times enumerated null values are used, they are set as strings in the database. Such as Name = ‘NULL_MISSING’. This has all the performance and string comparison problems of my previous argument, but one even worse – someone’s name/data might actually match a null equivalent value. Your system would be a lot more prone to SQL injection if such a thing were allowed and would require constant conversion between the enumerated value and a useful value in the UI, since you don’t want to expose NULL_MISSING to the user. Keep in mind, this includes alternative null-equivalent values such as using -1 or 0 for a positive numeric field. At some point -1 or 0 may be allowed, or accidentally displayed to the user as a real value. No matter how you set it up, enumerated nulls can often lead to bad data such as the issue with little bobby tables

Finally, there are many times null-equivalent values are used side by side with null values leading to unruly queries such as “WHERE widget IS NULL OR length(trim(widget)) = 0″. As most good DBA’s already know, disjunctive searching (using OR) in SQL queries can significantly hurt a query optimizer. Disjunctions are among the most common paths query optimizers will ignore because searching them is not possible in real-time. In short, if you are using an empty string as a null-equivalent value then you should use it everywhere and make the column not nullable. This will at the least simplify your queries and remove the disjunction. There are other ways to formulate the query above without disjunctions, such as using the negation of length > 0, but it still leads to complicated queries.

Part 3: Choose one
I know there are some developers uncomfortable with using database null or “IS NULL” in their queries, so to them I say, at least be consistent. Either use a null-equivalent value, such as an empty string, everywhere or use null everywhere. The ambiguity is when you allow both, which in some data models may mean different things. Overall, allowing both will likely cause a performance hit. As for which you should choose, null or null-equivalent, if you have strong reason to believe your DBMS will handle empty strings better than null values, then go with empty strings. I, on the other hand, will stick with database null values since they should be faster for the vast majority of queries.

The Joy of Null

Often in the database world, you do not have all the information needed to create a record. For example, you may have a person’s full name but not their middle name or initial, or you might be missing their date of birth. In such cases, the recommended solution is to fill that field with a special database value referred to as Null. Over the years and for a variety of reasons, I have noticed some database designers and software developers have invented their own equivalent of Null. This article is about how this came about and why the practice should be discontinued.

I’m inventing a new term today called null-equivalent value. A null-equivalent value is a value in a system that is not actually null, but meant to imply null. The most common null-equivalent value is a blank string.

Null

Part 1: What happened to null?

Oftentimes, database designers will (perhaps incorrectly) mark a column as “Not Null”, which translates to: this field is absolutely required for all records and we forbid anyone from inserting a record without this value populated. Some time later, software developers will then build an application on top of the database and realize they are missing some of the ‘required’ information needed to create a record. At this point the developer has one of three choices:

  • 1. Acquire the missing information before inserting the record
  • 2. Modify the database design to remove the “Not Null” attribute on the field
  • 3. Insert a null-equivalent value such as a blank string

Clearly, #1 is the best solution since it was the intention of the database designer that this field be populated, but, as I alluded to, that decision may have been incorrect. It may be that the field is not required, and creating records without this field make sense within the data model. In that case, the developer should consult a database designer and go with solution #2 – to remove the “not null” attribute from the record. Often, though, developers are prohibited from making changes to the database and the process of updating the database is long, vast, and sometimes risky. So the vast majority of developers will go with the ‘quick fix’ of #3, and insert a null-equivalent value since it’s the least work in the short term.

While this is the most common reason null-equivalent values are inserted into the database, it’s by far not the only reason. The second driving force results from a poor mix between the presentation layer and data layer. Assume a developer has the freedom to insert null values into a column, ie, the column is nullable. They may choose to instead insert a null-equivalent value such as a blank strings for a variety of reasons. For example, if a developer outputs the contents of the database directly to the screen, they often prefer to use empty strings rather than nulls since it’s a much more lazy approach – they do not need to convert the null values to empty strings, and can display them directly.

I’ve mentioned two of the main driving forces for null-equivalent values, but there others. I use the term null-equivalent instead of blank strings because developers will sometimes insert “special values” meant to imply null as well as have a secondary meaning such as “Lost Data”, “Missing”, or, in one case, “Null” (yes, these are real examples). In many of these cases, the data model should have been updated to include a boolean column, such as “Lost Data = true”, to explain the reason why the original field is blank. But many developers are nothing if not lazy, and the time required to make a database change, as I mentioned, is great. So, they will overload a column with a special single value, used to imply multiple values, all of which should have been fleshed out in a better database design.

The next article will continue this discussion and consider reasons why null-equivalent values are harmful to the database. In other words: TO BE CONTINUED.