Errors as Objects
Error handling is quite the bain of programmers. It is estimated that well over 60% of some programs is error handling code that contributes little to normal operation. Still, this code is required for the stability and robustness of software.
Error handling code has several tasks to perform when an error occurs:
- The state of the system should be restored to (ideally) pre-error conditions.
- The error should be logged with as much information as possible.
- A possible recovery action may need to be performed.
The first issue, restoring the state of the system is addressed by my other essays (see: Handling Allocation Errors). The other two issues, reporting and recovery are the subject of this essay.
As with most other system designs, error handling centers around data structure. In fact, applying traditional programming techniques (such as object-orientation) to error handling can greatly simplify code. Of course, as with all tradeoffs, this solution does have a snag. Namely, insufficient memory errors can not be handled with this technique and require a bit of special-case logic.
Error Information
Users get frustrated by software that reports an error without sufficient information to fix it. Developers would not tolerate a compiler that did not give errors with source line numbers at all, the same is true for any users of a system.
Like the answer to any question, an error should answer the six basic facts:
- Who
- What
- When
- Where
- Why
- How
It seems pretty obvious that this information can be bundled into an object (or record or structure) in the program to describe the error. This “error descriptor” object can be the basic unit of error information.
Coming up with a universal representation for these values can be an exercise in frustration; as always, the devil is in the details. Let us ignore this particular devil for the moment.
No other field can more aptly apply Murphy's law (That which can go wrong, will) than Computer Science. All too often, in processing an action or even handling a previous error will result in more errors. So “error descriptors” really should be linked together to form a chronological log of what went wrong during an operation.
Harsh Realities
Our error descriptor scheme is looking pretty good so far. We can certainly say we have the reporting task of error handling done. There is one case that is going to cause us some problems. Up to this point, we have just assumed that error objects can easily be allocated storage from the heap (or wherever dynamic memory can be obtained). One of the most common errors in systems that lack virtual memory is an insufficient memory condition.
This is where the blemish of a special-case comes in to the picture. We are really facing a chicken-and-egg problem: What to do when we have an error logging an error? The answer is that we log a special error that indicates we could not log any more errors due to memory conditions.
What this amounts to, programatically, is a single, global error descriptor that indicates an out-of-memory condition. This descriptor is special in that it is pre-allocated and can be linked into any error chain. Even this seemingly simple special casue is wraught with pitfalls.
For example, if we assume that each error descriptor is linked to its successor (in the future) via a pointer contained within each error descriptor (the most obvious compact and succint representation) then we must be extremely careful never to link an error after the global out-of-memory error descriptor. Since there is only one global error object this approach is not reentrant. One solution is to allocate the global errors on a per-thread basis and guarantee error operations are never interrupted. A better, more robust solution is to include a flag that prevents further reporting of errors if a global error descriptor is linked to the tail of the list.
Horizontal Code
Experienced developers are quite familiar with “horizontal code syndrome.” This is the classic approach to error handling, successive if statements that check the return code of each operation, and so on and so forth.
This approach has two principle disadvantages. First, it makes for large, bulky code. Reducing cache coherency, increasing memory footprint, and wasting excessive amounts of CPU time checking result code. Second, and more seriously, it results in unmanagable sphagetti code.
There is a somewhat whimsical solution that can remove much of the work involved in checking error codes: structured exception handling. Most often, structured exception handling is implemented as part of a programming language. For languages lacking this feature it can often be implemented using macros and slight of hand.
The basic idea is that instead of checking the return code of every operation the main flow of the program should proceed as if errors don't exist. When an error does happen the exception handlers are invoked, slowly unwinding the call stack and getting the program back to a known state.
To put the program back into a known state it is important that major changes to state be undone. This is the job of the exception frames. When a major state change occurs and it is important to know about an exception, an exception frame is added for the current procedure call frame to an exception stack. When an exception is thrown, the error descriptor is given to each exception frame for a chance to handle the error.
An Exception by any other Name
We now return to the fields of the error descriptor. Some are obvious, others are not. The simplest question to answer is: when. A timestamp is sufficient to note the time of the error. The next easiest question to answer is: where. Any kind of unique marker about the code (such as a module, filename, and line-number combination) is sufficient. Sometimes a label to indicate the major area of functionality will do as well. This field is of importance to the programmer or system analyist, so it can be terse and technical.
The remaining questions are a bit less straight forward to answer. The who and what are often objects. The who is typically the object receiving an event and the what is the event. The how and why are more like a cause code, detailing the exact error that occured. Usually the “why” is a bit more general than the “how.” The most common approach for these values is string identifiers.
In simple systems the string identifiers can be error messages displayed to the user. In more complex systems the identifiers can be looked up in a catalog to best describe the error to the user. Of course, there is a lot of room for variations here depending on the kind of software being written. It may be necessary to include other data (in a generic format) about the error. It is very likely impossible to come up with a completely generic error descriptor format for all software systems.
