Earlier this week I wrote about the basics of change tracking in SQL Server, and showed how to get started using this technology for change detection. In this post, I’ll continue what I started by demonstrating how change tracking fits into a larger design pattern for end-to-end incremental load ETL processes.
ETL processes fall into one of two broad categories: full loads or incremental loads. In a full ETL load, the target tables are truncated (or even dropped) every time the load process runs. While this is sometimes an effective load strategy – especially for smaller loads – a more common approach is the incremental load. In an incremental load, only the new and updated (and occasionally, the deleted) data from the source is processed. Although there are valid data load scenarios for both full and incremental loads, using the latter is generally preferable to the former. Some of the benefits of incremental loads are:
- Faster load times due to a smaller amount of data being processed
- Less database contention (again, because of a smaller scope of data)
- If the source data is no longer available, the previously loaded data in the destination will be unaffected
The one big requirement for doing incremental loads is that there must be a reliable mechanism for detecting the new and changed data in the source. There are several ways to do this depending on the source type, size and shape of data, and the ETL tool being used. This is where change tracking for SQL Server frequently fits in well. When designing a solution to detect deltas for incremental loads in SQL Server source data, change tracking is often the most effective way to isolate the new and changed data from that which has not been changed since the last load.
Requirements for Change Tracking
There are few prerequisites for using change tracking as a means for delta detection:
- The source data must be in SQL Server 2008 or later
- Each table to be monitored in change tracking must have a primary key
- Change tracking must be enabled at the database level (this requires Alter Database permission) before adding any tables
Although the list of requirements is fairly short, it is the last requirement in the list above that is most likely to rule out the possibility of using change tracking. Change tracking runs on the source system, and ETL access to those source systems is often limited to simply read operations (especially if the source data is coming from a vendor or other third party). Running any command to alter the database, even a relatively small change such as enabling change tracking, is often restricted and requires multiple layers of approval – if it is permitted at all. Now, don’t let this dissuade you from using change tracking, especially if your source system is a SQL Server instance under your organization’s control. I have not run into any issues where simply enabling change tracking has caused problems on source systems, and the benefits of change tracking far outweigh the minor amount of risk involved in enabling it.
Change Tracking for Incremental Loads
In the last post I showed how to enable change tracking, including a brief example of how change tracking works for insert, update, and delete operations. If you’ve not worked with change tracking before, I’d encourage you to read that post for an overview of how this technology works. I’ll now continue down that path by sharing a design pattern I like to use when change tracking can be used.
In this design pattern, there are three critical components:
- Change tracking on the source table(s)
- A mechanism to store the version number from one load to the next
- Logic in the load process to handle new and changed (and possibly deleted) records when loading to the destination
I demonstrated in my last post how to set up change tracking, so I’ll not repeat that information here. I’ll start with the second component above: the persistence of the version number.
Storing the ETL Version ID
The ETL version ID is the timeline marker used to indicate the last point in the change tracking version sequence when the ETL process ran. This design pattern uses a user-defined table (I called mine [etl].[Change_Tracking_Version]) to store the last version number processed by the most recent ETL load. For the sake of clarity in this blog post, I want to make certain I draw a distinction between this stored version number (which I refer to as the ETL version ID), and the current version number reported by change tracking (the change tracking version ID). The former is the value I’ll store in the above mentioned table explicitly created for the purpose of storing ETL metadata; the latter is the value returned from change tracking when I call the CHANGE_TRACKING_CURRENT_VERSION() function.
Every time the load process runs, it will first retrieve the ETL version ID of the last load, storing that value in a T-SQL variable I’ve named @StartVersionID. Next, the process will retrieve the current change tracking version ID (using the CHANGE_TRACKING_CURRENT_VERSION() function), writing that to a variable named @EndVersionID. Those two variables represent the start and end version IDs for this load. After the data load is complete, the @EndVersionID is now the most recent version number that has been processed in ETL, and that value will be written back to the [etl].[Change_Tracking_Version] table. This will be the starting point for the next load.
The table used to store the ETL version ID can be defined with as little or as much information as desired. In this example, I use a simple, two-column table: one column with the name of the table to be tracked, and the other a BIGINT column to store the version ID. I set the table name column to be used as primary key, because I never want to have multiple rows for the same tracked table in this version. I use a separate schema for ETL metadata, as shown below.
It might look unusual to use a MERGE statement instead of a simple insert. The reason for this is that I’ve configured the [etl].[Change_Tracking_Version] to store exactly one version ID value for each table. If there is already a value for the specified table, it should be updated; otherwise, a new row will be inserted. For each execution of the ETL process in the future, I’ll run a MERGE statement similar to the above to store the new ETL version ID (which will be used as the starting point for the very next load). The newly inserted value in [etl].[Change_Tracking_Version] is shown below.
The ETL process(es) that load the [Emp] will be configured to query [etl].[Change_Tracking_Version] at the beginning of each load to retrieve the version ID to start from. Each process will also retrieve the current change tracking version ID (using CHANGE_TRACKING_CURRENT_VERSION()) which will provide the ending version ID for that load.
With both of those values in hand, the source query against the [Emp] table would include the ETL version ID (passed to CHANGETABLE() as a parameter) and the ending change tracking version number (added to the WHERE clause). This gives us a fixed ending point for the duration of the ETL process, and would eliminate the possibility of rows “sneaking in” while the ETL is running. The source query against the [Emp] table would be configured as shown below.
In the above statement, I’m selecting only the new and changed data from the source table. I’m retrieving all of the records since the last ETL load, by passing in the last ETL version ID to the CHANGETABLE() function. I’m also filtering that result set in the WHERE clause by only including records with a version ID less than or equal to the value reported by CHANGE_TRACKING_CURRENT_VERSION() in the previous snippet. You may remember from the last post that the change tracking columns SYS_CHANGE_VERSION and SYS_CHANGE_CREATION_VERSION might each contain a different version number (if the record was inserted and then updated, for example), so to mark the ending point of the load I’m filtering at the row level by the higher of those two values.
Also regarding the query above, I’ll also briefly point out that I’m querying from the CHANGETABLE() function and performing a left join on the [Emp] table. The reason I built the query this way is that I want to see inserted, updated, and deleted records in this result. Because the deleted data would no longer exist in the [Emp] table, performing an inner join operation between these two would not return any hint of deleted data. However, by querying with the left join as shown above, the [ID] values for deleted rows will be included (from the call to CHANGETABLE()), allowing me to delete or expire the corresponding rows in my destination table.
Handle the New and Changed Data
Up to this point, the code I have shown will retrieve the last ETL version ID, and returns the full record for inserts and updates along with the ID values for deletes. But how do we use this data? There are myriad ways to handle this when loading to the destination tables, but one of my favorites is the merge operation (not unlike the simple one I created above for managing the ETL version ID). For this example, I’ll use a dimension table – [dw].[DimEmp] – as the destination for this load. Shown below is the structure and existing data in that table.
As of now, no changes have been made to the [Emp] table since I stored the ETL version ID in the [etl].[Change_Tracking_Version] table. To create a working set of data, I’ll modify some data in that table.
Now I’ll rerun the query to return all of the changes to [Emp].
This query now shows the four changes – two updates, one insert, and one delete – that I made to the [Emp] table.
To update this table, I take the most recent query from above, and wrap it in a merge statement. The select operation becomes the source query for the merge, and I’ll conditionally process the insert, update, and delete operations therein. Notice that this query doesn’t delete records in the destination when they are deleted in the source; rather, a bit flag called [Active] set to 0 indicates a soft-deleted record.
As shown, this code will perform a T-SQL MERGE against the [dw].[DimEmp] table, inserting new records and updating any records that were either modified or deleted from the source table [dbo].[Emp]. After executing this merge operation, the updated [dw].[DimEmp] table looks like this.
The record for EmpID = 7 was deleted in the source, and the merge operation properly marked this one as inactive. EmpID values 8 and 10 were updated in the source, and show the updated values in the merged data. The new row from [dbo].[Emp] – showing an EmpID of 13 – was not previously in the [dw].[DimEmp] table and was inserted. For each of these changes, the UpdatedDate was also updated; the rows with EmpIDs of 9 and 12 were not changed and therefore have the original UpdatedDate value.
Once the load is run, the only thing remaining is to update [etl].[Change_Tracking_Version] to reflect the new ETL change version, which will be the starting version for the next load. Using a merge statement similar to the one I used to originally populate this table, I’ll now update the value using the @EndVersionID value.
Although a bit of a long write-up, this post demonstrates an effective design pattern for using change tracking in an incremental load process.
In my next post, I’ll address this topic from the SSIS side, and will show how change tracking can be used in Integration Services packages to perform cross-server change detection.