I still remember the first real ETL process I developed. I was working for a hospital at the time, going through a major system implementation as we replaced a 17-year-old UNIX-based system with a more modern healthcare application suite with a SQL Server back end. I was tasked with building, testing, and executing the ETL processes for this conversion. While I was experienced at the essentials of database operations, I was still very new to the world of ETL and our tool of choice, SSIS. My first and only concern was moving the data as a one-time operation, so I put everything in one ginormous SSIS package. This package had dozens of tasks which were too numerous to even fit on a single screen, and the metadata alone for this package was over 5Mb in size.
In the end, the monolithic process I built did what it had to do. It was designed for a single use, and would not be invoked again after the final data load into the new system. Still, I learned some valuable lessons about reusability and testability during the development of that process, and as a result I completely changed the way I built ETL processes from that point forward. It is that design pattern that I’ll share with you in this installment of my ETL Best Practices series.
Atomicity is an overly fancy word to describe the breaking of things into individual parts. The design pattern of ETL atomicity involves identifying the distinct units of work and creating small and individually executable processes for each of those. Unlike the architecture of my first-ever ETL project in which all of the logic was contained in a single process, an atomic design breaks apart the distinct parts of the load into smaller chunks of logic that can each be invoked on their own. While the functional output of the two different design patterns should be the same, an atomic design consisting of more small pieces rather than a few larger ones makes for a more mature process.
The design of ETL atomicity requires an analysis of the parts of a process that are tightly related to each other, rather than focusing at a high level on what needs to be done by the entire project as a whole. I talked a bit about this in my post about ETL modularity, in which I discussed abstracting commonly used functionality to allow easy reuse. Designing ETL processes atomically takes that one step further by separating not just the pieces that might be reused in other loads, but spinning off any part of the load that might need to be modified, tested, or executed on its own.
Why ETL Atomicity?
There are numerous benefits to building smaller, narrowly-focused subprocesses rather than monolithic, do-everything packages.
The testing and debugging experience. If you’ve ever tried to unit test or debug a process with scores of moving parts, you know what a time sink this can be. If something breaks or doesn’t behave as expected, crawling through the code to find the problematic logic is unnecessarily complex and time consuming. By breaking apart the load into functional units, testing and debugging can be done individually to discover errors before full integration testing takes place.
Maintenance. Every ETL process will need to be revisited and tweaked at some point. By separating out the functions into their own processes, updating or replacing those behaviors becomes easier because it is now a modular design.
Sharing of development duties. If you work in a multi-developer environment and you have several staff members working on the same load, the development process will be far easier if each developer works on his/her own assigned subprocess rather than trying to reconcile changes in a single monolithic code file. As an aside for you SSIS developers out there: If you’ve never tried to diff an SSIS package file, don’t; this is an exercise in futility. Separate your logic into multiple packages if you develop as part of a team.
The UI experience. Most ETL tools have a graphical interface, which can become very busy when there are many moving parts involved. Additionally, many of them (including my tool of choice, SQL Server Integration Services) perform design-time validation while you edit packages, and having lots of connections and endpoints can slow down that validation process. Separating out functional units of work to separate subprocesses makes for a friendlier and more responsive UI experience.
Reusability. As noted, keeping ETL processes small and focused doesn’t have to be all about reusability, but this can be a fringe benefit of ETL atomicity. With smaller and more specific functional units, you may be able to reuse some of the components in the same process or even in other processes.
When using an atomic ETL design, you usually do end up with more code. But really, does it matter? As data professionals, our primary goal should not be to reduce the amount of code we write, but to write the most efficient code possible. If an atomic ETL design means that you have 20% more code (which is a very liberal estimate), you’ll get back many times more on your investment of time and storage through easier testing, simpler maintenance, and better reusability.
How Small Should These Processes Be?
When designing for ETL atomicity, should there be only one moving part per process, or two, or ten? As is often the case, the answer is It Depends. I don’t aim for a specific size or number of moving parts; rather, I focus on the functional unit of work. All of the tasks or steps within a single process should be directly related to the others. When evaluating the individual steps of a process to determine which ones (or groups) should stand on their own, I ask the following questions:
- Would it make the testing process easier if I test this task by itself?
- As it relates to this process, would I ever need to execute this task by itself?
- Will this process have logic after this task that would respond differently to a success versus a failure of this task?
- Might the logic in this task be changed or replaced without forcing the modification of other parts of the process?
- Could the logic in this task be reused in other parts of this process or other processes?
The more “Yes” answers I get above, the more likely that task or group of tasks will be cast into a standalone subprocess.
As a more concrete example, imagine a process that will download a zip file from an SFTP site, extract the contained files, truncate a set of staging tables, load those files to staging tables, and then archive the zip file. Given the above brief details, I lean toward the following design:
- A subprocess to perform the SFTP download. I would make this as generic as possible to be usable by other processes, since connecting to SFTP servers for data exchange is quite common in ETL workloads.
- A subprocess to extract the archived files. Again, because this is fairly common for ETL operations I would make this a generic modular process.
- A subprocess to truncate the staging tables and load those tables with the extracted files as a source. Note that there are two different tasks in this subprocess. This is intentional, because it is unlikely that I would need to truncate the staging tables without subsequently reloading them, and I definitely wouldn’t want to load to volatile staging tables without first truncating them.
- A subprocess to move the zip file to an archive location.
As shown, it’s not always a one-task-per-subprocess design. Grouping of tasks is appropriate when the collection works together to perform a single unit of work, and when execution or testing of a single piece from that group does not add value.
Tying Everything Together
Keep in mind that breaking apart processes into groups of closely related tasks does not mean that you still can’t have a single process controlling the execution. In an atomic ETL process, there is usually an orchestrator process that brings together all of the moving parts for an end-to-end load, providing a single point of invocation for that load. The difference between this pattern and the monolith I described at the beginning of the post is that this pattern embeds the working logic (where the “real work” is done) into those subprocesses, and the orchestrator or controller process handles the flow from one subprocess to another.
ETL modularity makes easier the task of developing and maintaining extract, transform, and load processes. By breaking apart functional units of work into individually testable, replaceable, and executable subprocesses, the overall load mechanism is more sustainable.
One other advantage in splitting the ETL process into atomic units is that it helps when something goes wrong and you need to reprocess data. Being able to do this using individul stand alone packages makes repairing a failed ETL load easier.
I am confused here. What’s the difference between ETL atomicity and ETL modularity?
Manali, that’s a fair question. There are attributes of these design patterns that are shared, so there are aspects of modularity that lend themselves to atomic design, and vice versa. Individually, modular design focuses on building executable components that are reusable and dynamic, while atomicity deals with appropriately sizing processes so that each one is an irreducable unit of work.