Error Checking Across Three-tiered Systems

For today’s post we’re going to delve into one of the least talked about but extremely common tasks a software developer works with, input validation, or error checking as its more commonly referred to as. Input validation is defined as taking what a user enters on screen, verifying it meets certain requirements, and returning a message to the user if it fails validation.

Let’s say you have a web page that requires a user to enter their zip code as part of account creation. What are the possible paths the user might take? Let’s list them:

  • Success: The user enters a 5-digit zip code
  • Error #1: The user leaves the zip code blank
  • Error #2: The user enters a non-number, such as “Hello”
  • Error #3: The user enters a zip code that has less than 5 digits such as “93”
  • Error #4: The user enters a zip code that has greater than 5 digits such as “9319299”
  • Error #5: The user enters a non-existent zip code such as “00000”

Let’s further add the conditions that the application has been developed using the common three tiered architecture pattern with a web-based HTML UI, a java-based application server, and a SQL-based database server.

The first question to ask is, what needs to be validated on which levels?

  • Top Tier: User Interface Validation

Stepping away from zip code for a second, let’s say you want to know the user’s birthday. You could ask them to enter it as a text field such as “10/11/1970” or, more commonly, ask them to use drop down menus to select the month, the day, and the year. The first type of input, where you give the user a lot of control such as a text field is referred to as unstructured input. The second type of input, where you bound the user’s choices to a fixed number of inputs is referred to as structured input. It should go without saying that structured input is far easier for a developer to work with than unstructured since the ‘number of places things can go wrong’ is significantly reduced.

Turning back to the zip code example, a structured input version might have a drop-down of every zip code in the country. That would certainly limit users from entering bad data and reduce most the errors above, but there are nearly 43,000 zip code in the US! The drop-down box would be hard to navigate, not to mention the bandwidth costs of sending every user the list.

For zip code input, we are stuck with unstructured input, but there are ways to reduce the chaotic nature of the input. For example, we could set an HTML width of 5 characters preventing the user from entering more than 5 digits, thereby preventing Error #4 all together.

For Errors #1, #2, and #3, we could use JavaScript to validate the input without every connecting to the application server. If we have extra time we should certainly implement this validation in JavaScript given obvious reduction in server load this would add. Unfortunately, web browsers are not well controlled parts of software systems and users have the freedom to turn JavaScript off. Therefore, no matter how good the front-tier validation is, its really only icing on the cake to improve performance and usability, the ‘meat’ of the validation belongs to the middle tier.

  • Middle Tier: Server side validation

As discussed, unless you have 100% control of a front-end application, which I can argue you never have, things can always reach the middle tier application server that are invalid. For example, a user could be connecting via a web service or by typing in URLs in a browser window. In both of these cases there is no front end to validate the user’s input. Therefore, the primary job of the application server is to provide services that handle all data input from clients and properly store this information in the database.

It is inferred by this logic, then, that the applications server needs someway of reporting errors to the its user. For example, if the zip code is entered incorrectly and discovered on the middle tier, the application server should send a nice, clean message to the user reporting the problem. When a developer forgets to handle this properly, you end up with web pages with ugly stack traces that I’m sure most of you have seen from time to time. In those instances, the developer forgot to properly encapsulate an error message with a user friendly one. It’s a good practice to put a large ‘catch-all’ around each application server entry point, so that in the event the developer missed taking care of an error, the user sees a generic ‘General System Error’ message. While generic messages such as this may not help the user out, it is far better than having them see a huge stack trace on the screen, which may give them private knowledge of the system such as source code paths and method names.

You may have noticed I skipped validating Error #5 on the UI tier, and with good reason. Although zip codes in the US may be 5 digits long, not all 5 digit long numbers are zip codes (logic 101)! For example, ‘00000’ is not a zip code in any state. In order to validate Error #5, you need a database table listing all possible zip codes to check again. Clearly, this is something that should not be done on the UI side since it would require the download of a long list of zip codes. A further validation might be to take the city and state a user enters and verify they belong to a particular zip code. The problem with such excessive validation is that if you’re database less than 100% accurate, the users may have issues in which a valid zip code is declared invalid, or a false positive to use testing terminology.

  • Bottom Tier: Database validation

The final validation is the place where the data ultimately ends up: the database. It is most often accomplish in the form of structured fields or uniqueness constraints. Regardless of whether the input is validated on the front or middle tier, the database ultimately owns the data and its rules cannot be violated.

If the database is so powerful, why not just do all input validation within the database? In the past, people have tried and the short answer is, performance suffers and it is difficult to maintain. For example, you shouldn’t need to go down all 3 tiers to check that zip code is a number; that sort of thing can be easily validated on the first two tiers. You do, on the other hand, need to go down all three tiers to check if a username is unique since it requires the knowledge from the databases to validate. That doesn’t mean you should just insert a user and wait for the database to fail, you should always check for possible errors ahead of time and catch them gracefully before moving on to the next level.

There are times, though, where the database validation is going to throw errors the other two layers cannot possibly check. For example, let’s say two users try to insert a a record at the same time with the same username ‘MrWidget’ and this field is declared UNIQUE in the database. Both users checked ahead of time to see if ‘MrWidget’ was available, found that the username was free, and committed to creating accounts with username ‘MrWidget’. Unfortunately, only one of these users will get name ‘MrWidget’, the other will get an error message. These race conditions are not very common in most systems, but are something your system should be designed to detect and handle when they do happen. A correct solution here would be to allow the user that submitted first to proceed and display a friendly error message to the second user alerting them the name is no longer available. This is also a good example of where a generic system exception is not going to help the user correct their situation since the username.

  • Final Thoughts

We’ve talked a lot about ‘where’ validation needs to take place but not necessarily ‘how’ we should implement it. For large enough systems, there is often a common validation package or method for each type of form submission that verifies the data both on the UI and middletier server. Database validation happens automatically, but recall in mind its better to avoid throwing SQL exceptions ahead of time if you can detect them. Some more advantages approaches, such as Struts, allow you to define basic validation rules in XML file then can then be used to generate Java form submission validation as well as JavaScript validation automatically. Keep in mind though, more advanced validation like checking to make sure a username exists cannot be accomplished with even these advanced validation techniques and always require a trip down all three tiers. The purpose of validation is to protect the system, but validation should always be implemented in a way that helps and supports the performance of the system.

J2EE: Why EJB2 CMP was Doomed to Failure

EJB2 Failure

For those in the J2EE world, one of the hottest, most contentious topics that has arisen is what went wrong with EJB, version 2. For those not familiar with the subject, I’ll try to provide a brief description. In 2001, Sun released a specification for connecting a Java object, called an Enterprise Java Bean, hereafter referred to as an EJB, to a database table through an construction called an Entity Bean. They referred to the technique as Container Manager Persistence, or CMP for short. Through the clever use of XML files, one could define Entity Bean relationships between Java code and a database, that could change on the fly without having to recompile the code or rewrite the application.

Initially, Entity Beans were well received as ‘the wave of the future’, but the problem lies in the fact that this specification was mostly worked out only in theory. The specification was written with such detailed requirements for the developer, that anyone who tried to implement Entity Beans in EJB2, had immediate code maintenance problems. There were just too many rules and maintaining large code bases was extremely difficult. In 2006, Sun released the Java Persistence API, JPA for short, in version 3 of EJB specification that, for all intents and purposes was a complete rewrite of EJB2 Entity Beans. They streamlined a lot of the interactions that were required to set up EJBs and borrowed a lot from more grass roots technologies like Hibernate and JDO. In essence, they threw out EJB2, copied Hibernate, then renamed the whole thing as ‘the next version of EJB’. Ironically, they might have been a little too late as many organizations had all ready switched to Hibernate by the time the implementations of the new specification were released.

These days most developers prefer Hibernate to EJB, although given the pervasive nature of Sun’s EJB/J2EE terminology I wouldn’t be surprised if JPA/EJB3 (the name at least) makes a comeback. As I look back on these recent events it makes me wonder, if EJB2 was such a good idea in theory, why did it fail so miserably in practice? What was it about Entity Beans people failed to consider? I hope to use this post to address some of this issues, albeit in hindsight, that the original planners of EJB2 did not consider.

  • Issue #1: Not all Servers are Created Equal

One of the most prominent features of Entity Beans in the EJB2 specification was their ability to work with any database system. A developer could, again in theory, write an application could seamlessly deploy on a variety of database systems (MySQL, Oracle, SQL Server, etc.) and applications servers (JBoss, WebLogic, WebSphere, etc.). The mechanism behind this fluid deploy was that all relationships between the database and application were stored in a single XML file. One just needed to open this XML file in a text editor, rename a few fields, and the application would work on that system. Need to deploy on a variety of systems? Just create a new XML file to map the application to that system.

Sounds wonderful in theory, but the practically speaking? This never worked, the servers were just too different. For example, while all databases have standardized on common language called SQL for 90% of the database communication, it would be a near miracle for you to find a code base that did not use any database-specific features. Most large enough systems rely on stored procedures, indexes, foreign key relationships, and key generation that while may be similar from database to database, doesn’t mean you can just drag and drop code from one system to another. Porting code from one database system to another often requires editing thousands of files line-by-line looking for changes and is a quite a difficult, often impossible task.

And that’s just database servers, application servers require their own server-tailored XML to perform the data mappings for each and every field in the relationship. Most databases have an average of a hundred or so tables with a couple dozen fields per table, so you’d have maintain a server-specific XML file for thousands of fields for each application server you wanted to deploy to. When a developer says porting an EJB2 CMP application to a new application server is hard, this is a large part of what he’s referring to.

Part of the problem with the variation among application servers was Sun’s own fault. While they defined the vast majority of the spec themselves, they, for whatever reason, allowed each application server software to define its own extensions for such features as database-specific fields. This wiggle-room allowed application server developers to create schemas as different as night and day.

  • Issue #2: Maintaining large XML files in J2EE is a contradiction

Going back to a single-server application, let’s say you do implement an EJB2 CMP-based system. At the very least you have an Entity system composed of a hundred java files, a single XML file (ejb-jar.xml) that defines the entity beans, and a single application-specific server XML file (jboss-jdbc.xml, open-ejb.xml, etc) that maps the entity beans to the database. What happens if you make a change to the java code or the database? Those files must be updated, of course! That means a developer needs to have exclusive access to edit these 2, quite large, XML files. In large enough groups, developers will be fighting to make changes to these files and contention for this file will be high in the version control system.

The history of J2EE is that it was designed to be a component oriented large business platform. Have a project with only a single developer? No reason to use J2EE since it can be done a lot faster by one developer. On the other hand, have a large project that requires dozens, perhaps hundreds, of developers working together? That’s what you ‘need’ J2EE for. But now add the fact that hundreds of developers are going to be writing and rewriting 2 very large XML files and you come to the realization that the technique is fundamentally flawed. You need J2EE for large development teams but maintaining a lot of single-point large files tends to break down faster in large development groups.

  • Issue #3: Databases change far less frequently than software

We’ve previously discussed how EJB2 defined all of those XML mappings in a way to allow you to change the database and application server without ever having to recompile code. The idea (I guess) being that database changes and application server changes were far more likely than code changes. Well, in practice that idea is completely wrong. The vast majority of J2EE applications work on one specific database type and one specific application server (often a specific version of the software). Furthermore, database changes are extremely rare. Often, once the database structure is defined, it is rarely, if ever, changed. Java code on the other hand is often rebuilt on a nightly basis. You would never get into the situation in practice where a change to an existing database system was needed before the code was ready to support it. Add to that the fact that most developers go out of there way to write solutions to software problems that do not require changes to the database! Most are fully away of the problems associated with changing the database (modify all related java and XML files, notify related teams, write database patches, etc), and will only change the database as a last resort.

In short, EJB2 defined a powerful spec for an ever changing persistence database layer, and no one bothers to use it in practice because of the maintenance issues involved. I’m sure I’m not the only developer that has seen poorly named database tables and fields like ‘u_Address_Line3’ (instead of ‘City’) but refrained from changing them knowing the havoc such changes would likely bring. Since the user is never supposed to view the database directly, why should it matter what things are named?

EJB2 Post Mortem and Lessons Learned
Not all of these issues were unaddressed during the life of EJB2. For example, XDoclet is a tool that provided a way for generating the large XML files for each system after a code change was made. The problem? It was never developed to support all systems and still had developers fighting to check in use their version of generated file to the version control system. Ultimately though, after thoughts like XDoclet were too little too late; people had seen how much more useful the implementation of object relational mapping tools such as Hibernate had become, and jumped shipped early on. When JPA finally did come out in EJB3 the audience was far from excited: the most dedicated of the crowd who had adopted EJB2 early on were left with no way to upgrade their systems other than to do a near-complete rewrite of their CMP system, and those in the crowd who would have probably most enjoyed the new features of JPA had all ready switched to Hibernate, with no reason to switch back.

When JPA/EJB3 was released, it virtually did away with all of the XML files, instead choosing to put the database mapping information directly in the java code file in the form of an annotation tag. In this manner, a developer only needed to look to one place to make changes to the entity bean and database. In the end, EJB2 entity beans is yet another example of why systems designed in theory should be tested more in practice before being trumpeted as the ‘the wave of the future’. It’s not that JPA is more superior to EJB2, in fact far from it. As I’ve mentioned EJB2 had amazing data control and could was capable of many powerful features. The problem lies in the inherent contradiction of creating J2EE as a component-oriented platform while enforcing centralized interactions and overly complicated management schemes. In the end, it was hard to find one developer, let alone a team of 30, capable of managing the EJB2 code base properly.

Best Buy: Countdown until We Force You Buy a New TV

If you haven’t been paying attention the US government has been manipulated into outlawing analog TV broadcasts in February of 2009. Most people I talk to don’t understand the issue but will soon enough when perhaps millions of TVs around the country suddenly stop working. The big winners? Cable and TV manufacturers who are essentially using their lobby to force people to either purchase a cable box in every room in their home or buy a lot of new TVs. The big losers? The American public.

But in every battle of corporate greed versus the every day man there are those that take things to a whole new level of absurdity. So without further ado, I present Best Buy’s approach, to place an exciting countdown bar on their home page:

Best Buy - DTV CountDown

In related news, there was a number of recent articles on the subject saying how nations around the world are years ahead of the US because instead of shutting their analog network down, they are leveraging them to provide free TV for cell phone users throughout the country.

I’m all for technology upgrades but this one seems unnecessary to me. In particular, there’s no justification for why we need to shut down the old network. We could have both running for many years and give Americans time to adjust. Cable companies fought harder than anyone on this, the dream of using a cable feed for multiple TVs in the house is a thing of the past. Got 5 TVs in your home? That will be 5 cable boxes billed monthly, please.