In the last post in my ongoing series about ETL best practices, I discussed the importance of error handling in ETL processes, reviewing best practices for application flow to prevent or gracefully recover from a systematic error or data anomaly. In this post, I’ll dig a bit further into that topic to explore the design patterns for managing bad data in ETL processes.
What is Bad Data?
Bad is a subjective term, and by extension, so is bad data. Therefore, there is no single irrefutable definition of bad data; it can and will differ from one organization to the next, and from one ETL process to another. However, here is the general guideline that I follow:
Bad data is data whose origin or accuracy is suspect, and as a result, the risk of its use outweighs the potential business value.
Exactly what qualifies as bad data will vary widely. When loading a list of contact information for potential sales leads, having missing or incorrect data values is expected and is most likely not a disqualifying attribute. On the other hand, in a load of financial data, a missing or invalid value for a single row could make the entire load of data suspect. Similarly, the origin of the data could call into question its value (for any load into an audited system) or may not have much of an impact (for example, when calculating sentiment analysis).
Often the first and largest challenge when building an ETL architecture is deciding what is and is not bad data. Ultimately, the business needs – not necessarily technical requirements – should drive this logic.
One last point on defining the boundaries of bad data: it’s not always a binary decision. Indeed, most such evaluations are pass/fail, but occasionally the need arises to get more granular in how data quality is graded. Create these relative data quality grades only if you must; each bucket you create adds to the complexity and maintenance overhead.
Defining the granularity of bad data handling normally means deciding on one of two options: row-by-row or the load as a whole.
Manage suspect data row-by-row
Processing suspect data is typically a row-by-row operation. Of these two design patterns, handling bad data a row at a time is by far the more common and the easier approach to take. Using this approach, the ETL process will filter out each suspect row of data based on the criteria set up in the prior phase, and then cleanse, redirect, or delete (more on these options later) each of these rows.
Manage data as an all-or-nothing load
There are business cases that require a load-level approach to bad data, where the entire load process succeeds or fails as one unit. In this design, even one suspect row of data among millions would invalidate the entire load. The all-or-nothing architecture requires a rollback or deletion mechanism that is automatically triggered in the case of a failure.
A perfect example of this is an ETL process loading general ledger entries. In such loads, excluding even a single row of data due to data quality issues would throw off the balance of the entire load. Here it’s better to have no data loaded at all than to have a partial (and therefore incorrect) set of data written to the destination.
All-or-nothing load architectures are more complex than their row-by-row error handling counterparts. Use this design pattern only if the business or technical needs require it; when in doubt, process errors or other anomalies at the row level.
Managing Bad Data in ETL
Once the definition of bad data and the granularity have been agreed upon, the next phase is to design the tactical approach to managing bad data in ETL.
Although the details of handling bad data will be very technical, always keep in mind that the business needs should dictate the behavior. The first and primary concern when handling bad data should always be the business impact.
Clean Up in ETL
When bad data is found, use the ETL process to clean up and continue processing that data if possible. Resolving data deficiencies in-flight usually has the least amount of residual baggage. Suspect data that can be cleaned inline does not suffer from the delay of having a separate process (or worse, human intervention) to perform clean-up before it can be used. Cleansing data inline also makes data lineage clearer, as the bifurcated cleansing path usually merges back in with the organically clean data prior to the load to the destination.
As shown in the flowchart above, this pattern keeps the known-good data and the suspect data in parallel pipelines, bringing them back together once the deficiencies have been cleaned up. This pattern assumes a row-level granularity.
Direct data cleansing is my preferred method for handling suspect data. However, I don’t want to imply that data cleansing is a simple process. Defining the business rules and technical parts to detect and cleanse data usually requires a significant time investment, especially when those business rules are complex.
Most ETL systems have built-in logic to handle some of the more common data cleansing operations, as well as scripting interfaces to allow you to roll your own custom data quality solution. For more complex needs, a third-party solution could be the right way to go. For example, you could write your own address verification logic, but creating that logic AND maintaining an accurate database of addresses would likely cost more time and money than using a third-party vendor already equipped for such operations. Do pay attention to the performance implications and data volume limits when using a vendor for data quality operation, as most of these use a cloud service and license their software based on data volume.
When building an in-line data cleansing architecture for managing bad data in ETL, keep the following design points in mind:
- Because this process will modify data in flight, be sure to keep an audit trail of cleansing changes.
- Will there be cases where the data cannot be fixed? If so, create a safety valve to triage (or delete, if appropriate) unresolvable bad data.
- Because each suspect row of data will be handled individually, be aware of potential performance bottlenecks if the cleansing operation requires a lot of system resources.
The triage design pattern is useful for data quality issues that cannot be resolved in-line. Here, the suspect rows are left intact and are written to an alternate location to be dealt with in a separate process. The ETL process becomes a two-output system, where good data goes to its typical destination while suspect data is written to the triage output.
Although this flowchart indicates a simpler design than the in-line cleansing architecture, this only shows part of the solution. Rarely is the suspect data left in triage permanently; in most cases, there are additional processes – or perhaps even human intervention – for remediating triaged data and loading it to the final destination.
This design pattern is useful for cases where the remediation is too complex to be modeled in a cleansing business rule, or when the business rules are subjective and require human intervention to approve any data changes. Additionally, this pattern can be helpful when some of the business rules depend on aggregations of the entire set of data which would be difficult to handle in-line.
A few things to keep in mind when using a triage output to collect suspect data:
- Be sure to thoroughly test the process(es) that interact with the triaged data. It’s not enough to send suspect data to triage, so a unit test of the above architecture is not enough.
- When building those peripheral processes that cleanse the data in triage, remember to eject from triage any rows cleaned up and written to the final output. Otherwise, the triage table will continue to grow and performance will eventually suffer.
- Because this pattern uses two (or potentially more) outputs, it is possible to inadvertently lose data in the ETL pipeline if one of the outputs was not properly wired up. It is highly recommended to use a row count audit to make sure that all incoming rows of data are accounted for in one of the outputs.
I hesitate to even bring up this design pattern for reasons I’ll enumerate shortly. However, in the interest of comprehensive coverage on the topic, I’ll briefly describe the pattern of deleting bad data. In short, this design pattern looks very similar to the triage pattern described above, except that bad data is simply discarded.
As shown in the above flowchart, only the good data is retained. Rows failing the data quality checks are sent to the bit bucket.
On the positive side, this design pattern has the distinction of being the easiest to develop. The downside is – and I can’t emphasize enough the gravity of this point – you’re going to lose data. Not just the data – you also lose any audit trail showing that the data was processed but rejected. In addition, you will have lost the ability to test the logic for managing data quality using historical bad data. In some edge cases, this may be acceptable. You should consider this design pattern for processing suspected bad data only if all of the following are true:
- There is no practical way to resolve the data deficiency causing the records to fall into an error state
- There is no business value in having any of the data in any column or field in a row of data that failed the data quality checks
- The removal of one or more records does not impact the validity or completeness of the imported data
- There are no processes that will require auditing in this set of data
- There are no plans to change the criteria for what defines a good vs. bad row of data
Personally, even if all of the above statements are true, I’m still going to resist simply deleting those suspect rows. I hate losing data, even if it’s on purpose. If nothing else, keeping the bad data in a dump table or file can be useful for testing the data quality evaluation logic.
In reality, most pipelines for managing bad data in ETL using an amalgamation of the above design patterns. Frequently I see the in-line cleansing used with the triage pattern, in which data that cannot be cleansed in-line is sent to a triage table or file. Further, in an ETL project consisting of many discrete load processes, the handling of suspect data can and will vary from one load to the next.
I briefly mentioned this earlier, but it is important enough to repeat. When building a process that manages bad data, be sure that you capture a digital audit trail when manipulating or filtering data before loading it to its final destination. This chain of evidence is critical for troubleshooting, data lineage documentation, and occasionally, CYA purposes.
Managing bad data in ETL is a requirement in many enterprise ETL projects. Data can be cleansed in the ETL, stored in triage, or simply discarded when found to be deficient. Above all else, remember that the processing of suspect data should meet the data needs of the business.
Tim, there is also the “recycle” architectural pattern for advanced data cleansing that requires human intervention by data stewards. In this scenario, the business has determined that at least some of the time they cannot fix the data in the source so they want an interface to enable manual data fixes that override incoming data. This pattern allows for routing bad data to the recycle subsystem and then integrating fixes from the steward back into the incoming dataflow as an overriding value. The system should provide for a data steward to designate the problem data as either going to be fixed in the source or providing the overriding value. This should be coupled with data steward notifications and DQ reporting as tools to enable stewards to stay on top of the issues. MDS (Microsoft Master Data Services) could be used to provide an effective data steward interface. DQS could also provide that function however I have had significant performance issues with my testing of DQS as well as questions about long term support by Microsoft for DQS so I would not recommend it.
Hi Gary, thanks for the comment. Yes, I’ve used the recycle pattern both in practice and in some of my courses. I consider that to be a subpattern of the triage architecture, since we’re separating the good data from the bad and moving them into separate outputs. And I share your concerns about the future of DQS, since it hasn’t had a significant update since its release in 2012.