“If it compiles, it works.” – An unemployed developer
Building ETL processes is quite easy. Building ETL processes that deliver accurate results as quickly as possible is substantially more difficult. Modern ETL tools (including my personal favorite, SQL Server Integration Services) make it deceptively easy to create simple load process. That’s a good thing, because an easy-to-understand front end shortens the timeline of going from zero to first results.
The challenge with such a low bar to entry is that some folks will stop refining the process when the load process is successful.
ETL Antipattern: failure to test and validate
When building an ETL process, your first successful load isn’t the last step before deployment. That first success is simply a prerequisite for the testing and validation of the results. The fact that a load process completed without errors is similar to an application compiling successfully: even though no errors were found, that doesn’t mean the code is doing what is intended.
There are many ways a “successful” ETL load can go wrong:
- It can introduce duplicate data
- It can incorrectly delete data
- It may be crossing over environments (such as retrieving from dev but loading to production)
- It may not touch any data at all (but hey, it ran really fast!)
Even in cases where the results are correct, there may still be time bombs waiting to detonate:
- The data set you’ve been developing with is substantially smaller than the production data. You’re likely to discover some surprises in terms of performance when moving to production.
- The development data set is too clean. If there are data anomalies in production that your development data does not include, you may discover some leaks in your ETL processes.
- There are hard-coded values in the logic that will need to be manually changed when moving to production
Don’t skip validation and testing
When working with any enterprise ETL tool – or if you’re creating your own ETL logic by hand through scripting – don’t be fooled into thinking that an error-free data load is a success. Data validation and testing shouldn’t be optional exercises. Nor should these be afterthoughts: testing and validation need to be formal parts of the development process.
When formalizing your data validation and testing as part of the ETL development, be sure to include the following:
- Regression testing. Capture the state of the data before the load process to allow easier comparison to the results post-load.
- Look beyond simple row counts. Comparing before-and-after row counts is one measure of a data load, but it does little to fully validate those results
- Automate some of the key validation points. Don’t rely solely on eyes-on-data to confirm that the load is working as designed. Whether you use third-party tools or create your own validation logic, you can speed up some of the process of data validation by using automation to look at metrics that don’t require a human to interpret.
- Use test data that is as close as possible to production data in terms of volume and quality. If possible, use a copy of production data (but not the actual production data, please!) for testing. If this isn’t possible (due to legal or regulatory constraints, or in cases where the production data contains sensitive data that should not be copied to testing environments), it’ll require extra work to anonymize or otherwise build a testing data set, but it’ll be worth the trouble.
- Be sure to test the code promotion process, too. This will help you find hard-coded values that should be parameterized when promoting your code to production.