Imagine for a moment that you’ve built a software thing. In fact, we’ll call it The Thing. You put a lot of work into The Thing, and it does exactly what you wanted it to. You put The Thing into play as part of a larger solution and, after a couple of revisions, its behavior is verified and it is blessed as being production-ready. Months down the road, while building another software solution, you run across a problem similar to the one you solved with The Thing. You make a copy of the original The Thing, tweak it for this purpose, and drop it into the new solution. Repeat this cycle a few times. Five years later, you look up and realize that you’ve got two dozen adaptations of The Thing, all somewhat different and sprinkled across the entire breadth of your code base.
This, dear readers, is how most ETL solutions are developed.
This design, which often results in dozens – perhaps even hundreds – of copies of the same logic is created with good intentions. I need to use The Thing here, I make a copy of the last Thing and modify it for this use case. It solves a problem quickly. However, this technical debt-inducing behavior costs far more in long-term maintenance than the amount of time saved by refactoring a copy of an existing piece of code.
This quandary is not unique to ETL. Every coding project in every organization has some amount of purposeful code duplication. ETL projects, however, seem to be particularly susceptible to this issue. Developers, architects, and project managers fear the unenviable designation of being the person who broke the data warehouse feed, the customer data load, or the shareholder report. The short path is to duplicate and adapt. But is it the only way?
In my continuing series on ETL Best Practices, I’ll share some thoughts about why ETL modularity helps to reduce the total development time and costs of ownership.
Of all the ETL design patterns I write about, there is none that delivers more return on your time investment than modularizing your ETL processes. In a nutshell, ETL modularization involves abstracting common tasks into reusable units of work. For a more concrete example, let’s consider the scenario of downloading a file from an FTP server, a common task in most any ETL infrastructure. Since most ETL software suites have built-in tasks to handle FTP downloads, a common design pattern is to include that task (with hard-coded paths to the source file and local destination folder) in each ETL process requiring an FTP download. As described, the download and import process would have the following flow.
This pattern works splendidly for a single process. However, you’ll never have just a single ETL process that downloads a file from FTP. These things always travel in herds, so the odds are good you’ve got a ton of related tasks. When you set them up piece by piece over months or even years, your patterns start to look a little like this.
See what happened? We started with a good pattern in ETL Process A, but the other two varied slightly in their approaches by each leaving out something from the original pattern. It might actually have been the intention of the ETL developer(s) to purposefully design these processes with different behaviors. Most of the time, however, that’s not the case – the delta between these design patterns is caused by human error rather than by intent.
Modularizing ETL processes can help avoid issues like these. A modularized design involves breaking out common operations into separate and distinct processes that can be invoked generically. These modularized processes are parameterized to allow the passing in of operational values at runtime, and are centrally stored so that they are available and visible for common use.
Among the benefits of modularizing ETL processes:
- Standardized behaviors in core functionality as well as in peripheral tasks such as logging and error handling
- Reduce duplicate work in future ETL development
- Easier change processes
- Easier unit testing
- Easier sharing of development tasks across multiple developers
- Smaller surface area for troubleshooting when things go wrong
Establishing an ETL architecture using modular and reusable components takes time and effort to implement, and just like any other piece of software, will mature over time with proper care and feeding. The work of modularization normally involves four phases:
- Inventory of common tasks. What can be modularized? How large or small should those units of work be? I don’t want to trivialize the work that goes into this phase, because deciding which tasks are reusable enough to justify the work (and more importantly, can be made generic without overwhelming complexity) requires a lot of effort. Reviewing logs of past executions, investigating candidate processes that might be consolidated, and building prototypes are all required to do this properly.
- Build the modular process. This phase can be made easier by reusing some of the mature components of existing processes (if they already exist).
- Identifying and externalizing runtime values. Any value that might be changed (such as the FTP server name or download location, in the example above) should be set as a runtime parameter rather than a hard-coded value.
- Standardization of use. This phase mostly involves documentation and education, and is intended to eliminate the quick-and-easy (and technical-debt-inducing) hard-coded steps modeled above.
In the brief example designs above, there are two easily-identifiable opportunities for modularization: the download of files from FTP, and the movement of files to an archive. Because these types of operations are very common, and because each includes logging steps to capture some information about the operation (always a good practice – see my ETL Logging design post in this series), these will both be good candidates for modularization. By encapsulating the common behaviors and parameterizing the values needed at runtime, the following design is reusable and much easier to maintain.
As shown, the high-level loads (ETL Processes A, B, and C) are blissfully unaware of what goes into an FTP download or a file archive operation. The only thing important to those three processes is the parameters required and the result of each operation. As new processes are added, the developers can simply invoke the modules for FTP downloads and/or file archive as needed, without having to rewrite those pieces of logic every time. Although this pattern does require more up-front work, the effort required to update the modular logic or add new processes using said logic is far less than maintaining many copies of the same ETL tasks.
One final note on ETL modularity: Not everything can be made generic enough to fit into this pattern. You’ll notice that the above example made no attempt to genericize the import of the flat file data into the staging table. This was intentional, to demonstrate the fact that some components of ETL will still require piecemeal development. Data flow operations – in particular, those that bind together metadata from the source to the destination – are particularly challenging to modularize. I’m not saying that it can’t be done, but the effort to do so is often greater than the downstream benefit. As noted earlier, one of the most difficult tasks you’ll go through is figuring out which operations can reasonably be modularized, and which ones cannot.
Building modular ETL processes is challenging and requires discipline, especially when first starting down the path of modularization. However, a properly designed ETL modularity saves time and effort, is easier to test and troubleshoot, and reduces the ongoing cost of your ETL operations.