In most data projects, building the extract-transform-load (ETL) logic takes a significant amount of time. Enterprise ETL processes must do several things well: retrieve enough data to satisfy the business needs, apply any needed transformations to that data, and load it to the destination(s) without interruption to any other business processes. The work that goes into building and validating that ETL logic can be significant, making the resulting code a very valuable asset to the enterprise.
However, in my travels I’ve discovered that there’s a lot of ETL code that doesn’t get the kind of care it deserves. Failing to treat ETL logic as source code can be a costly and time-consuming mistake.
ETL Antipattern: failing to treat ETL logic as source code
Here’s the key takeaway from this: ETL code is source code. Source code should be versioned, backed up, and attached to formal change processes. Therefore, ETL code should always be versioned, backed up, and constrained by formal change processes.
The all-too-common antipattern here is when ETL code is treated as a kind of throwaway utility that can be changed on the fly and, if necessary, can be easily rebuilt. My company has gotten more than a few of those calls from potential clients where a seemingly small and harmless change to the ETL caused big trouble in the downstream systems. In a few cases, some part of the ETL logic was lost due to a system upgrade or inadvertent deletion of the code.
ETL code is often more valuable to the business than the hardware it resides on, and therefore should be treated as a business asset. ETL code is source code and should be handled as such, including:
- Using a proper source control system for storing and versioning the code
- Maintaining a separate development and/or testing environment (NOT the Production environment!) for testing changes
- Formal procedures for change control of ETL code, including advance notification of such changes
- Regression testing and data validation of any changes prior to moving the source code to Production
Yes, these steps take time and cost money. If your enterprise is used to handling ETL development as an ad-hoc operation, then treating ETL logic as source code is guaranteed to slow down your development process. However, these are necessary steps to protect your company’s investment in its data and the processes that support the data.