One of the most significant design considerations in ETL process development is the volume of data to be processed. Most ETL processes have time constraints that require them to complete their load operations within a given window, and the time required to process data will often dictate the design of the load.
One of the more common mistakes I’ve seen is that the ETL design, and by extension the performance expectation, is based on the current volume of data. This works well today, but more often than not, data volume will increase over time – sometimes exponentially! A load process that handles a million records easily will perform very differently with 10x or 100x that volume of data.
ETL antipattern: load processes that don’t scale
With very few exceptions, data volume will increase over time. Even when using an incremental load pattern, the most common trend is for the net data (new + changed) to increase with time. Even with steady, linear changes, it’s possible to outgrow the ETL design or system resources. With significant data explosion – commonly occurring in corporate acquisitions, data conversions, or rapid company growth – the ETL needs can quickly outrun the capacity.
Refactoring ETL for significant data growth isn’t always as simple as throwing more resources at the problem. Building ETL for proper scaling requires not just hefty hardware or service tiers; it requires good underlying data movement and transformation patterns that allow for larger volumes of data.
Building for growth
Designing and building ETL processes that scale is cheaper and easier to do up front. Refactoring a load process after the fact is basically a whole new project, requiring new rounds of testing, data validation, and deployment. You can overcome some design limitations by throwing more resources (hardware for on-prem solutions, or service tiers for hosted or cloud solutions) at it, but you can end up spending far more trying to overcome an inefficient ETL design than what it would have cost to build it properly the first time.
When building ETL processes that scale, keep the following design patterns in mind:
- Don’t just build for today’s needs. Think through the technical and business demands on the data, and ask yourself what’s the most extreme case of data growth that could occur. Design your ETL so that it will meet that volume of data.
- Consider nontraditional ETL/ELT patterns. The traditional extract-transform-load model works well for most scenarios, but depending on the data volume and transformation needs, you may need to move to an extract-load-transform (ELT) pattern to offload some of the processing to the target database. In some cases, an iterative process (ETLT, ELTLT, ELTTT, and various other alphabet soup) may be needed.
- It’s not just about data volume. Consider the T (transformation) part of the ETL as well. If your data volume grows modestly but you expect over time that the transformation of that data will need to increase (such as the addition of new business rules or data cleansing), you should design your processes in such a way that additional transformations won’t cripple the process.
- Keep it set-based as much as possible. This is a good rule of thumb even outside of ETL processes. If you’re dealing with relational data, let the database engine do what it does well, and that means sticking to set-based processing whenever possible.
- Limit the amount of data you’re touching. I wrote about this in an earlier tip in this series. If you can reduce the amount of data you need to ingest in your ETL processes, you’ll almost certainly see better performance.
- Test your processes with a large volume of data. While it’s a good idea to use testing data sets that are as close to your production data as possible, it’s a good idea to occasionally test those load processes with larger sets of data to get a sense of the scalability. If you are currently processing n rows but are expecting 10n or 20n in the near term, start testing with that volume (or even larger, if possible).
- Don’t just assume you can throw hardware (or higher service tiers) at any future data scale problems. Yes, there will come a time where the ETL process will need more horsepower, but don’t treat that as a silver bullet for all of your ETL performance needs. Starting with a good design will often delay the need to ramp up on resources.