When moving data in an extraction, transformation, and loading (ETL) process, the most efficient design pattern is to touch only the data you must, copying just the data that was newly added or modified since the last load was run. This pattern of incremental loads usually presents the least amount of risk, takes less time to run, and preserves the historical accuracy of the data.
In this post, I’ll share what an incremental load is and why it is the ideal design for most ETL processes.
What is an Incremental Load?
An incremental load is the selective movement of data from one system to another. An incremental load pattern will attempt to identify the data that was created or modified since the last time the load process ran. This differs from the conventional full data load, which copies the entire set of data from a given source. The selectivity of the incremental design usually reduces the system overhead required for the ETL process.
The selection of data to move is often temporal, based on when the data was created or most recently updated. In some cases, the new or changed data cannot be easily identified solely in the source, so it must be compared to the data already in the destination for the incremental data load to work properly.
Why use an Incremental Load?
Using an incremental load process to move and transform data has several benefits and a few drawbacks as well.
Benefits of Incremental Data Loads
Incremental data loads have several advantages over full data load.
They typically run considerably faster since they touch less data. Assuming no bottlenecks, the time to move and transform data is proportional to the amount of data being touched. If you touch half as much data, the run time is often reduced at a similar scale.
Because they touch less data, the surface area of risk for any given load is reduced. Any load process has the potential of failing or otherwise behaving incorrectly and leaving the destination data in an inconsistent state. Fractional load processes will add or modify less data, reducing the amount of data that might need to be corrected in the event of an anomaly. Data validation and change verification can also take less time with less data to review.
Incremental load performance is usually steady over time. If you run a full load, the time required to process is monotonically increasing because today’s load will always have more data than yesterday’s. Because incremental loads only move the delta, you can expect more consistent performance over time.
They can preserve historical data. Many source systems purge old data periodically. However, there may still be a need to report on that data in downstream systems. By only loading new and changed data, you can preserve in your destination data store all the source data, including that which has since been deleted from its upstream source.
Challenges of Incrementally Loading Data
Incremental data loads are usually the preferred way to go; in fact, this design pattern is on my inventory of ETL Best Practices. However, there are some things to be aware of when considering this pattern.
Incremental logic tends to be more complex. We sometimes refer to a full load as a “dumb load”, because it’s an incredibly simple operation. A full load takes all the data in structure X and moves it to structure Y. With incremental loads, the developer must add additional load logic to find the new and changed data. That added complexity can be minimal or it can be very significant.
There’s not always a clear way to identify new and changed data. If you can’t easily identify in the source the changed data, you’ll have to choose from one of several not-as-good options for managing the incremental payload.
When to Use This Design
Part of being a good data engineer is knowing when to use which load type. The decision to use an incremental or full load should be made on a case-by-case basis. There are a lot of variables that can affect the speed, accuracy, and reliability of a load process, so you shouldn’t assume that the solution that worked last time is going to be a perfect fit in the next one.
Here are a few factors which favor toward using an incremental pattern of loading:
- The size of the data source is relatively large
- Querying the source data can be slow, due to the size of the data or the technical limitations
- There is a solid means through which changes can be detected (more on that in a moment)
- Data is occasionally deleted from the source system, but you want to retain deleted data in the destination (such as a data warehouse or an auditing system)
On the opposite side of that, some payloads are more favorable to performing a full load:
- The size of the data is very small
- There is no easy way of determining which data is new or changed in the source
- Historical data can be changed, even from months or years in the past
- You need the destination to look exactly like the source
How to Implement This Design
For loading just new and changed data, there are two broad approaches on how to handle this: source detection or destination comparison. I’ll review a few examples of each of these.
Source Change Detection
Source change detection is a pattern in which we use the selection criteria from the source system to retrieve only the new and changed data since the last time the load process was run. This method limits how much data is pulled into the ETL pipeline by only extracting the data that actually needs to be moved, leaving the unchanged data out of the load cycle. The less data we have to touch, the less time it takes to process it.
Here are a few of the ways you can implement source change detection.
Using row-level insert or update dates. This is the simplest way to do change detection on the source. In this method, each table has a column to store the date the data was loaded and another for the date on which it was modified. Each row can then self-identify as to when it was changed. However, I have found that this method can be unreliable, especially in third-party systems over which you have no control. I’ll expand on my caution for this method in another post in the future.
Using change tracking. Change tracking is available in many popular relational databases (RDBMS), and is my favorite way of handling change detection because of its simplicity, reliability, and ease of use. Change tracking can easily identify which rows have been added, changed, or deleted since the last load. Using change tracking for incremental loads requires alteration to the source database, so this may not always be an option.
Using change data capture. Change data capture is another source-side method of detecting changes, available in SQL Server and other RDBMS systems. Although change data capture has far more functionality than simple change detection, it can be used to determine which data has been added, changed, or deleted.
Detecting changes in the source is the easiest way to handle delta detection, but it’s not perfect. Most of the methods for source-side change detection require that the source resides on a relational database, and in many cases, you’ll need some level of metadata control over that database. This may not be possible if the database is part of a vendor software package, or if it is external to your network entirely.
Destination Change Comparison
If the source for a given ETL process doesn’t support source change detection, you can fall back to a source-to-destination comparison to determine which data should be inserted or updated. This method of change detection requires a row-by-row analysis to differentiate unchanged data from that which has recently been created or modified. Because of this, you’ll not see the same level of performance as with source change detection. To make this work, all the data (or at least the entire period of data that you care about monitoring for changes) must be brought into the ETL pipeline for comparison.
Although it doesn’t perform as well as source-side change detection, using this comparison method has the fewest technical assumptions. It can be used with most any structured data source, including text files, XML, API result sets, and other non-relational structures.
If using a comparison, there are several different ways to approach this.
Using a row hash. With a row hash, you’ll have a single column that stores a sort of computed binary version of the columns that you want to use for comparison. The hashed binary value is a computed aggregate of all the columns you specify, a sort of unique sampling of the data. The row hash is usually more efficient than manually comparing each column. However, you do have to calculate the hash value, so there’s a bit of added complexity and overhead there, but it’s normally worth it for the performance gain.
Using an upsert operation, such as a merge statement. An upsert (update + insert) will handle both the new and changed data in a single step. It does this by matching the unique key column(s) and comparing the columns you want to check for changes. In fact, you can use the row hash method inside this merge operation to do the comparison. Most RDBMS systems have some implementation of upsert logic, including SQL Server, Oracle, Postgres, and MySQL. In most systems, the upsert operation will also allow you to deal with data that has been deleted in the source, whether you want to delete it in the target, soft delete it (leave it in place but mark it as deleted), or set an ending effective date.
Using a brute force comparison. This is as ugly as it sounds – going row by row and column by column to compare differences – but is also the method with the fewest requirements for change to the source system. A brute force comparison will work when no other change detection method will. Use this as a last resort if you can’t use any other method.
Using incremental loads to move data can shorten the run times of your ETL processes and reduce the risk when something goes wrong. Though this pattern isn’t right for every situation, the incremental load is flexible enough to consider for most any type of load.