In my last post in the ETL Antipatterns series, I wrote about the common antipattern of ingesting or loading more data than necessary. This brief post covers one specific case of loading more data than necessary by performing a full data load rather than using a smaller incremental load.
ETL Antipattern: performing full loads instead of incremental loads
Earlier this year, I wrote about the concepts of incremental loads and discussed the benefits of loading data incrementally. To recap: an incremental load moves only the new and changed data from each source – rather than the entire bulk of the source data – through the ETL pipeline.
Using incremental loads can improve both the speed and accuracy of data movement and transformation. The time required to process data increases with the volume of said data, and extracting only the new and changed data from the source can ensure an accurate ‘point-in-time’ representation of the data. For these reasons, loading data incrementally is, for most data load needs, the better way to go.
What about full data loads?
The alternative to incrementally loading data is to perform a full data load each time – that is, extract the whole of the data from the source and push it all to its final destination in every load cycle. ETL folks use several terms to refer to this, including “whack and load”, “trunc and dump”, and other phrases that hint at the coarse nature of such a design.
Performing a full data load isn’t necessarily an elegant process, but here’s a truism: full data loads are much simpler and easier to build than incremental loads. If you need a quick-and-dirty way to get data from A to B, a full data load is the shortest path. And there are some legitimate cases where this design makes more sense than adding the complexity needed for a incremental load approach. For example, if you are doing a multi-step load in which you load the data to volatile staging tables first (an ELT versus an ETL approach), then using a full “trunc and dump” for that step of the load makes the most sense.
There are very few absolutes in technical architecture, and this is no exception: full data loads aren’t always bad. However, in the real world I’ve discovered many ETL processes built to perform full data loads on every cycle when an incremental design would have been more accurate and better performing.
Default to incremental loads
Any time I build an ETL process, I approach it with the assumption that I’ll be doing an incremental load. If there are factors that make such a design unreasonable – whether it’s difficulty in detecting changes, or that the source data is so small and static that an incremental load doesn’t add value – then I’ll revert to a full load. But for most data movement and transformation needs, the incremental load should be the default design.