Recently, I wrote about how to get started with SQL Server change tracking, and I demonstrated a design pattern I use with change tracking in incremental load scenarios. In this post, I’ll round out the topic by showing how using change tracking in SSIS packages can add more flexibility to ETL processes.
Using Change Tracking in SSIS
In my last post I showed how to do a complete load, including querying the changed data and performing the inserts and updates on the target table. However, this design had one significant limitation: it assumed that the source and the destination databases were on the same SQL Server instance. The only way to make such a pattern work with multiple servers is to use linked servers, and my regular readers know that I hate using linked servers more than I hate canned sardines. To help avoid such a smelly solution, an ETL tool such as SSIS can broker the data exchange for incremental loads.
Although there is not any direct support in SSIS for change tracking, it can still easily be used in SSIS packages. Because change tracking uses T-SQL, the SSIS packages see change tracking functionality as just another database command or result set. In the example below, I’ll show how to take the script-based example from the previous post and wrap it in an SSIS package, which can easily use different SQL Server instances for the source and destination.
The Incremental Load
To keep things simple, I’ll use the same data structures from the last example. As a brief review, my OLTP source table is named [dbo].[Emp], and has the following structure and data.
The data warehouse table to be loaded by this process is named [dw].[DimEmp] and looks like this.
This design pattern also uses a control table named [etl].[Change_Tracking_Version] which stores the change tracking version ID from the last time the ETL process was run. This table resides on the same SQL Server instance as the source table, and is updated with a new change tracking ID at the end of every ETL operation.
Finally, I’ll add one more table to make it easier to separate the source from the destination. This new table lives on the destination SQL Server instance, and will be used for staging the update operations.
Building the Package
Just like the T-SQL script version of this operation, the solution in SSIS will execute the following steps:
- Retrieve the last ETL version ID from the control table [etl].[Change_Tracking_Version], and store that value in a variable
- Query and store in a variable the current change tracking version, which will mark the ending point of this load
- Using change tracking functions, query the [dbo].[Emp] table from the source. The two variables mentioned above will be used in the query to set the version ID boundaries of the load
- Insert or update data in the destination table [dw].[DimEmp] based on the results from the above step
- Upon success of the [dw].[DimEmp] load, update the value in [etl].[Change_Tracking_Version] with the variable value used as the ending point for this load. That value will be the starting point for the next load
Because this updated methodology will allow for source and destinations on different SQL Server instances, there will be a few minor changes for this design pattern:
- SSIS, not pure T-SQL, will orchestrate the load
- The starting and ending point version IDs previously stored in T-SQL variables will now be kept in SSIS variables
- Insert and update operations, previously integrated in a single T-SQL merge statement, will now be separated
- A new staging table (for which the DDL appears above) will be used to stage rows for update, to allow for set-based operation
The first step required in building the SSIS package is to set up the connections. In this case, there are two connections: one to the source OLTP data, and another to the destination instance. For this example I call the source OLTP, and the destination is DW.
Next, there are 2 SSIS package variables required to store version IDs.
Populating these starting and ending version IDs can easily be done with a parameterized query in a SQL task. As shown below, I’m querying the [etl].[Change_Tracking_Version] table to get the last ETL version ID for this table.
On the Parameter Mapping tab of this task, I’ll map the parameter value (shown above with a placeholder of “?”) to the @vStartVersionID variable.
This same process is repeated in a different SQL task, with a slightly different query to retrieve the current change tracking version number to use as the ending marker for this load. The output from the query below will be written to the @vEndVersionID variable.
To keep things tidy, I’ve put both of these SQL tasks in a sequence container. No precedence constraints are needed yet, since neither of these depends on the other (and it’s probably a tiny bit faster by allowing both queries to run in parallel). I’ll also add one more SQL task, which will truncate the [dwstage].[DimEmp] table. This new SQL task will execute the following statement.
This task to truncate the staging table will be set to run after the previous two tasks. So far, the package looks like this.
Next, I’ll create a data flow task to process the new and changed data from the source table. In that data flow I create an OleDB source, using the OLTP connection created above. The query in this source looks almost identical to the one used in the T-SQL example in the last post. The only difference in this pattern is that the query uses SSIS variables for the start and end version IDs – the same variables populated in the previous steps.
The two literal question marks in the query above represent placeholders for the version ID parameters. Clicking the Parameters button reveals the Parameters substitution window, where I map the SSIS variables to those params.
Like most data flow operations, this one uses a derived column transformation to apply minor changes to the data. In this case, I’m translating a “D” value for the Operation field (representing a Delete) into a  (or False) value for the [Active] field. I’ve also added the [UpdatedDate] field as the current date and time.
So far it’s a fairly simple data flow. (Don’t worry – I’ll add a bit of complexity in the next step.)
You may recall that I mentioned this pattern will separate the inserts from the updates. This is that fork in the road. By adding in a conditional split to filter on operation – with Inserts sent one direction, Updates and Deletes in another – I create two different output paths. The conditional split transformation is shown below.
Rows with an Operation value of I (for Insert) are sent to the output named Inserts. Any other values are assigned to the default output which I named Updates and Deletes (which are the only other possible values in the Operation field). From here, the rows sent to the Inserts output are sent directly to the destination table ([dw].[DimEmp]) through an OleDB destination. For rows in the Updates and Deletes path, a second OleDB destination will be used to write those records to the staging table ([dwstage].[DimEmp]) created a few paragraphs ago.
That’s it for the data flow task. The final layout looks like this.
Now, the new rows have been loaded, but the updated and deleted rows are still waiting in the staging table. I’ll update these using a simple SQL task. Although this could be done in a single statement, my design pattern uses separate queries for updates and deletes. As shown below, the pertinent values are updated depending on if the row change operation from the source was an update or a delete.
The final step is to update the [etl].[Change_Tracking_Version] table. I’ll use the query below in a new SQL task, and will substitute the literal question mark with the value from the @vEndVersionID SSIS variable.
With all of the tasks and components in place, the final package control flow looks like the the following.
The package is ready to run as-is. However, the first run will be anticlimactic because nothing has changed in the source! Change tracking would do exactly what it should do: it returns zero rows because no inserts, updates, or deletes have taken place since the last ETL load. To force some data for this example, I’ll run the change script I used in the last post.
After running the above script containing insert, update, and delete operations, I’ll run a test on the package. When I peek at the data flow, I see that the package properly detected new and changed rows, and sent them out to different outputs as designed.
A total of four rows were changed: one insert, two updates, and one delete. The data flow has properly sent the one insert record directly to the output table, and the 3 update/delete rows were mapped to the staging table. A quick query against the destination table [dw].[DimEmp] shows that all four records were properly handled. The new row appears, two of the rows have updated values (as well as new UpdatedDate signatures), and the deleted record from the source has been soft deleted.
This post describes how using change tracking in SSIS can make the process of incremental loads much easier. As shown, the same change tracking patterns I discussed in the first two posts in this series can easily be extended to SSIS packages as well. Although there are no tasks or components specific to SQL Server change tracking, this functionality can be used in Integration Services packages through plain T-SQL. Using SSIS packages in conjunction with change tracking opens up many new possibilities for incremental loads, especially when the source and destination are on different SQL Server instances.