Before I began my technical career over a decade and a half ago, I spent several years working in law enforcement. In that field, one of the things one must learn quickly is the concept of the chain of custody of evidence. There were numerous procedures we had to follow to ensure that evidence was not just gathered and preserved, but fully documented as to when and where it was collected, who took possession, when, and for what reason. These processes, while rigid and time-consuming, helped to build a necessary trail of documentation to protect both the evidence and those in possession of it. While it was not common that the integrity of the evidence was called into question, having a well-documented chain of custody would eliminate ambiguity about where the evidence had originated and who might have had access.
I find a parallel between the evidentiary chain of custody and the need to capture the origin of the data during movement and transformation processes. Data can be handed off from one system to another, or even between processes in the same system, many times during a single load cycle. Data can arrive from multiple systems at the same time, often occurring rapidly and in parallel. In some cases, the ETL load itself may generate new data. Through all of this, we still have to be able to answer two fundamental questions: Where did this data come from, and how did it get here?
In the data world, the design pattern of ETL data lineage is our chain of custody. In this ETL Best Practices tip, I’ll discuss the importance of ETL data lineage and will demonstrate some design tips on adding this to your new and existing processes.
ETL Data Lineage
Why is data lineage important?
- It builds trust in the data by making clear the origins of data
- Simplifies the troubleshooting process by allowing the tracing of data at the row level
- Reduces the risk of ETL data loss by making “holes” in the process more apparent
- Enables greater visibility of business rules that would otherwise be buried in an ETL load process
The concept of data lineage is relatively simple: Build ETL processes in such a way that a single row of data can be tracked back to the source from where it came and how it came to arrive here. Simple enough as a design pattern, right? However, simple does not mean easy, and as a result I find that most ETL processes do not establish clear data lineage. Properly building and testing the elements of tracking data lineage takes time and effort. It requires provisions in the data itself – some key or keys to allow tracking the data from original source to final destination – as well as thorough documentation of the processes it passes through to get there. In most cases, establishing data lineage also includes auditing of changes that take place during the various stages of ETL.
Relative vs. Absolute Data Lineage
There are two options for how to build ETL data lineage: either track each row all the way back to its origin using the first available key value (absolute lineage), or for an ETL process with multiple steps, set up each table so that every row will have a pointer back to the key from the table that immediately preceded that load process (relative lineage). The former pattern is better suited for simpler setups, where there are only one or two ETL transformation phases (or gates) and the progression of data from one gate to another is clear and unambiguous. However, the pattern of absolute data lineage is easily outgrown, requiring a graduation to a relative data lineage model.
As shown in the model above, the relative data lineage model is a bit more complex, requiring a walk through multiple gates to determine the origin of a particular row. However, the relative data lineage model also scales much more easily, and simplifies the process of tracking atypical load patterns (such as those where one of the steps in the ETL processes can generate new data without a corresponding row from the source).
What Does Data Lineage Look Like?
Let’s set up a concrete example of this. Familiar scenario: A client sends data to us using Excel. A snippet of this data file is below.
A quick peek reveals some obvious patterns in this data:
- There are two different grains of data represented in this file: patient demographics and transaction details
- There is row-level unique key. PatientID is specific to the patient, but we have duplicated values for patient “Wayne, Bruce”
- It is dirty. Names appear in two different formats.
- It is incomplete. Quantity is missing on one line, and another is missing Line Total and New Balance.
- It has a John Doe record. Perhaps this will require special handling?
One of the fundamentals of data lineage is having one or more fields that ensure row-level uniqueness. In this case, we do not have any such fields from the source, so we will need to manufacture our own in this case. Even in cases where a field (or combination of fields) appears to provide row-level uniqueness, I recommend that you resist the temptation to use the natural unique key(s) for data lineage purposes. Although it could save an extra column by skipping the surrogate unique key, there are a number of cases where using natural keys could cause problems: populating the same table from multiple sources with potentially overlapping values; inadvertent duplicate key values where none are expected by the ETL process; and type 2 slowly changing dimensions where business key duplication is expected by design. It takes a bit more effort and some additional storage space to use a surrogate unique key, but the benefits far outweigh the costs.
When creating table-level row identifiers, in most cases a whole number (either a 4- or 8-byte integer) works best. To land this data in a database table, I would create an additional column as an automatically increasing integer value – referred to as an identity value in most database systems – to provide a row-level anchor for uniqueness and data lineage tracking. Below, I show the staging table used to land the data from the file above. Note that I have made no modifications to the schema at this point, save for the addition of this new identity column (SourceRowID), and the ETLID column to track the ID of the ETL operation that processed the load (more on this later).
The new column of SourceRowID will now be used for data lineage purposes.
In most enterprise ETL systems, data will go through multiple gates to reshape, clean up, and standardize the data. In the case of the data sample above, the first transformation gate will be to separate the two grains of data – patient and charge item – into separate output tables. The charge item output is easier, because the data sample shown translates to exactly one charge item per row of data. Simply grabbing those items specific to the charge items along with the newly added data lineage metadata results in a charge item output table similar to the example below.
As shown, moving the charge item data to its own table preserves the rows specific to the charge line items as well as the SourceRowID value generated when the data landed in the original staging table above. In this table I also added a ChargeItemKey identity field which uniquely identifies each charge item. The ChargeItemKey field is not mandatory for this design, but I typically use an identity column in each new table in a gated ETL design – especially when the data is being reshaped when passing through that gate. Using this pattern of creating a new surrogate ID for each table also supports the flexible pattern of relative data lineage mentioned above.
Dealing with patient information from the original data file is a bit more complex than the charge items. Because there are known duplications in the data, moving data into a table with granularity at the patient level will require some deduplication. There are many methods to accomplish deduplication, so I’ll save the specifics of that operation for a different day. After deduplication, we end up with a patients staging table that looks similar to this.
As before, the grain-specific fields are passed through to the new patients table. the SourceRowID is retained as a pointer back to the original data, retaining the data lineage. Because this gate also managed the deduplication of patient records, the output is five rows, not the six shown in the original. Note that the deduplication process in this example does not explicitly track the patient record for SourceRowID 2 (the duplicate of patient “Wayne, Bruce”), since the patient-level data was rolled up into the row marked with SourceRowID of 1. In some ETL data lineage designs, it may be necessary to track the rows that are purposefully discarded due to deduplication. This is where a good ETL audit strategy can be helpful, by tracking those rows sacrificed to deduplication and other business rules.
This relatively trivial example shows a brief summary of a design in which a unique row identifier is manufactured and retained throughout two stages of ETL. Obviously, the state of this data would most likely change further: data cleanup would be performed on the format of the name column, the null numeric columns would likely need to be cleaned up, and the John Doe record has some deficiencies (no AccountStatus or ChargeEvent values) that may require either manufacturing a value or redirecting the entire row to triage. Whether this process has two gates or twenty, the same ETL data lineage pattern would continue to hold true: each row should be traceable back to its origin either directly (through absolute lineage) or indirectly (using relative lineage).
ETL Process Tracking
When implementing data lineage, I almost always add ETL process lineage IDs as well. These process IDs – represented in the above example in the ETLID column – identify the instance of the ETL process execution that loaded the data. As shown above, these IDs are different from one table to another, but are duplicated in each table. This is by design; all of the rows inserted or updated in a given table in the same ETL cycle would share an ETL ID value, and those ETL IDs are specific to each table load in most cases. Keeping track of row-level lineage as well as ETL operation IDs together help to create an electronic trail showing the path that each row of data takes through the ETL pipeline.
Do all ETL processes require data lineage tracking? No. Smaller, less complex ETL processes might not require the same level (if at all) of lineage tracking that would be found on a large, multi-gate data warehouse load. More gates and/or transformations that take place during the ETL will increase the need for full data lineage tracking. When in doubt, my recommendation is to spend the extra time to build ETL data lineage into your data pipeline. It’s better to have it and not need it than the reverse.
ETL data lineage tracking is a necessary but sadly underutilized design pattern. Proper data lineage identification helps to build a more solid and trustworthy ETL process that is easier to audit, simpler to troubleshoot, and more clear in its operation.