ETL Error Handling

ETL Error HandlingIn designing a proper ETL architecture, there are two key questions that must be answered. The first is, “What should this process do?” Defining the data start and end points, transformations, filtering, and other steps must be done before any other work can proceed. The second question that must be answered is “What should happen when the process fails?” Too often, the design stops without having asked or answered this second question. ETL error handling becomes an afterthought rather than a key part of the design.

In this installment of my ETL Best Practices series, I will discuss the importance of error handling logic in ETL processes.

Fail-First Design

Proper ETL error handling isn’t something that can simply be bolted on at the end of a project. Rather, it should be part of the architecture from the initial design. Creating error logic after the ETL processes are built is akin to adding plumbing to a house after all of the walls have already gone up: it can be done, but it’s ugly. Error management done right requires that those pieces are designed and built alongside, not after, the core of the ETL application. For this reason, I practice what I refer to as a fail-first design. In this design, I give equal attention to what happens when the load goes right or wrong. This approach does take longer, but it also makes for more robust and predictable behavior.

ETL Error Handling

Managing error logic in ETL processes is a broad topic, but can be broken down in the following approaches.

  • Error prevention: Fix or bypass the error condition to stop the process from failing.
  • Error response: When a failure is inevitable, properly responding to the error is critical.

To be clear, managing failure conditions doesn’t mean you have to choose from the above two approaches. Effective ETL architectures typically include both prevention and response, shaping a two-pronged approach to error management. I’ll cover each of these approaches below.

Error Prevention

There are some conditions that cause failures in load processes that can be prevented using ETL error handling logic. Among the cases where error prevention is useful:

  • The load process has multiple steps, but a single failure should not interrupt the rest of the load
  • There is a known deficiency in some part of the data that can be fixed as part of the ETL process
  • Some part of the ETL process is volatile and you need to retry in the event of a failure

The first item in the list above is the most common scenario in which error prevention is useful. Let’s say you are running ETL operations sourcing retail data across the world and every single time zone. For the load in a single time zone, there might be retail outlets whose data is not yet available (the store stayed open late, network delays, etc.). That scenario doesn’t necessarily represent an error, just a delay in processing. In this example, there is business value in loading the available data, skipping those sources not yet ready. In a case like this, it is common to allow individual tasks to fail without interrupting the ETL process, then failing the process as a whole once all of the tasks have been executed. The process could attempt to execute all of the load tasks, failing at the end only if one or more individual steps failed.

While error prevention is a useful tool in your ETL bag of tricks, let me encourage you not to suppress error conditions simply to force your ETL processes to succeed. Although this is usually well-meaning, it can also cause some unexpected issues if not done properly. Just because an error can be prevented doesn’t mean it should be! Preventing errors is often a business logic decision rather than a technical one. Above all else, be sure that any error prevention or suppression solves the business problem at hand.

Error Response

Not all ETL errors should be prevented. In fact, most business cases lend themselves to error response rather than prevention. As strange as it sounds, there is value in allowing ETL processes to fail:

  • Dependency management: If Step B depends on the success of Step A, then Step B should not be executed if A fails.
  • Error reporting: Often, notifications are triggered by errors. Allowing processes to fail ensures those notifications are performed.
  • Error triggers: Tasks that fail can trigger error response logic for additional notifications, cleanup operations, etc.

It is in the error response design that I liberally use the phrase failing gracefully. In designing processes that are built to fail, you have to consider the ways that a thing can go wrong and decide what should happen when it does. For many processes, a graceful failure is simply a failure with no other action. Consider the example of a process that loads a volatile staging table. Typically this involves truncating that table followed by a reload of said table. If that process happens to fail, there’s nothing to clean up – just fix the deficiency and rerun the truncate/reload sequence. However, it’s a different story when loading transactional data to a production table. If that transactional load fails halfway through, you’ve got a partially loaded target table. You can’t just rerun the process, lest you have duplicated transactions. In that case, there must be some ETL error handling process to clean up from the failed partial load.

In some cases, it is possible to predict a failure and respond accordingly. Some ETL tools, including my favorite such tool (SSIS), include functionality to check for potential failures prior to execution. In SSIS, you can run a validation against a package without actually executing said package. While this validation process won’t detect every possible failure, it does check some common error points. Other validation steps can be performed manually, such as checking that a directory exists or that the execution account has permission to run the load.

When designing error response logic, be sure to define the granularity. Although we usually talk about error management at the task level, it is possible to manage errors at the row level. Row-level error management is usually classified as a data quality operation rather than one of error management. However, since a single bad row of data can break an otherwise successful load, it becomes part of this domain. When deciding on a row-level error management strategy, the design gets more complicated: What do we do with the bad rows of data? Should a certain percentage of bad data force a failure of the entire load? When implementing such a design, there must be business value in loading the good data while skipping (or loading to triage) the bad data.

The details of the error response logic will be different for each process. Again, these are often business decisions rather than technical ones. Ultimately, the design of any ETL load process should include a provision for post-failure operations. Using the fail-first design, assume that every ETL process will fail, and build accordingly.

Fail Early If Possible

All other things being equal, it is better to fail early rather than late in a process. Early failure avoids unnecessary overhead if the entire process would have to be rerun in the event of a failure. Failing early rather than late also reduces the possibility that post-failure data cleanup will be necessary. There are cases where late failures are inevitable (or even preferable). However, for most ETL load scenarios, this rule holds true: fail early if possible.

Logging

It is rare that I write a post about ETL that does not include a reminder about logging. Nowhere is logging more important than in the design of error handling. If your ETL process does nothing else upon failure, at a minimum it should be logging the details of the failure. Capturing failure logging information is critical for triage and architecture improvement. This is doubly true if your ETL process suppresses or prevents errors. The only mechanism that gives you historical insight into error logic is a good logging subsystem. Whether you use the tools built into your ETL software or write your own, be sure that you are capturing enough logging information to fully document any and all failures.

Testing Error Logic

Earlier I commented that error logic is often an afterthought. This is even more true for testing error logic. Building tests around error prevention and response logic is as important as testing the core load logic. I can’t even recall the number of times I’ve been called in to assist with ETL processes to find that not only were the loads failing, but the error logic was failing as well. Testing load processes is difficult, and testing ETL error handling is even more so. However, only proper testing will detect deficiencies in your error logic.

Conclusion

Managing errors in ETL should be central to the design of those processes. A good ETL strategy will have measures for preventing and responding to errors at the right granularity. With well-designed ETL error handling, processes are much more robust, predictable, and maintainable.

About the Author

Tim Mitchell
Tim Mitchell is a data architect and consultant who specializes in getting rid of data pain points. Need help with data warehousing, ETL, reporting, or training? If so, contact Tim for a no-obligation 30-minute chat.

3 Comments on "ETL Error Handling"

  1. Regarding the comment that it is better to fail early rather than risk needing to run the entire process. Ideally the ETL process is designed to be restarted and have it skip the tasks that have already completed. This allows it to quickly resume from at point of original point of failure. As long as the process has been broken down into discrete steps, this should allow for minimal rework. There does need to be an easy way to tell the job to simply start over in the event that is necessary.

    • Hi Ron, thanks for the reminder about restartability. Since that is such a big topic on its own, I’m planning to address that in a separate post in the ETL Best Practices series.

  2. Steven Neumersky | January 4, 2017 at 8:25 pm | Reply

    It is also probably a good idea to separate the checkpoint restart functionality (control flow level) with other design patterns to allow for restart without losing data flow pipeline progress (data flow task level with appropriate source query logic that compares the target table to the source table)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.