In my last ETL Antipatterns post, I wrote about the unexciting but very necessary work of documenting ETL processes. The logging of ETL operations is just as (un)captivating as documentation, but is equally as important in the support of data movement and transformation processes.
In this post, I’ll discuss a common misstep in ETL process management: ignoring the logs.
What is ETL Logging?
ETL process logging is the configuration and management of log data generated during data movement and transformations. More concisely, ETL logging captures data load history statistics.
Here are a few of the types of details that are gathered in ETL process logs:
- When the load process started and finished
- The result (success or failure) of the process
- The number of rows processed in each step
- Any error or warning messages
ETL logging serves several valuable purposes:
- It makes it easier to monitor process duration over time
- It allows you to measure data volume growth over time
- It provides a trail of evidence in the event of failure or anomaly
A good ETL logging design requires two things: that each invocation of the ETL process is logged, and that there is a periodic review of those logs to look for anomalies, growth patterns, and other hints that changes may be needed.
ETL Antipattern: ignore the logging
Because logging doesn’t add any functionality to the ETL process, this necessary part of the ETL architecture is often ignored. Without good logging information, there is no easy way to determine how long the processes are running, which parts of the process are more prone to failure, or what the first steps are in the event of a failure.
While it is true that ETL processes can and do run perfectly well without an effective logging strategy, this is certainly an antipattern of good ETL design. Failure to set up and monitor ETL log data means that you will be handling any problems that arise reactively rather than proactively. Unmonitored ETL processes have a tendency to develop problems at the most inconvenient times (hello, 4am wakeup call on New Year’s Day!), so you’ll be better served by getting out in front of any potential problems early.
Create an effective ETL logging strategy
Every organization is different, so there’s no one single set of tactics that will work everywhere. However, there are a few universal guidelines you can use for building your ETL logging strategy.
Make sure each ETL process is configured with the proper logging
What is proper logging? As with everything, it depends. Some processes will require only that you log when each load begins and ends along with its final status. With others, you’ll want to gather row count information to track data volume. Still others may require detailed informational and progress messages for proper auditing.
Some ETL tools will require you to set up logging for each process. Others, including SQL Server Integration Services, log a certain amount of information by default, and allows you to customize that as needed. Pay attention to how your ETL software handles logging, and configure it accordingly.
Monitor the log data
What good is capturing log data if you don’t look at it? Make sure you have a plan for either programmatically or manually reviewing information in your logs on a regular basis. Don’t just look for load failures; review details such as data growth, increases in load runtimes, and any warnings to proactively address changes before they become problems.
Send notifications for key events
For certain logged events, you’ll want to notify support personnel immediately. Load failures certainly come to mind here, but don’t forget to look at other non-failure anomalies such as mismatched row counts or a high percentage of error rows in a single load. Use the information captured in the logs to send notifications for events that need to be addressed quickly.