Most traditional ETL processes perform their loads using three distinct and serial processes: extraction, followed by transformation, and finally a load to the destination. However, for some large or complex loads, using ETL staging tables can make for better performance and less complexity.
As part of my continuing series on ETL Best Practices, in this post I will some advice on the use of ETL staging tables.
Convention 3-Step ETL
Extract, transform, and load processes, as implied in that label, typically have the following workflow:
- Retrieve (extract) the data from its source, which can be a relational database, flat file, or cloud storage
- Reshape and cleanse (transform) data as needed to fit into the destination schema and to apply any cleansing or business rules
- Insert (load) the transformed data into the destination, which is usually (but not always) a relational database table
This typical workflow assumes that each ETL process handles the transformation inline, usually in memory and before data lands on the destination. For most ETL needs, this pattern works well. Every enterprise-class ETL tool is built with complex transformation tools, capable of handling many of these common cleansing, deduplication, and reshaping tasks. This three-step process of moving and manipulating data lends itself to simplicity, and all other things being equal, simpler is better.
However, there are cases where a simple extract, transform, and load design doesn’t fit well. Among these potential cases:
- Each row to be loaded requires something from one or more other rows in that same set of data (for example, determining order or grouping, or a running total)
- The source data is used to update (rather than insert into) the destination
- The ETL process is an incremental load, but the volume of data is significant enough that doing a row-by-row comparison in the transformation step does not perform well
- The data transformation needs require multiple steps, and the output of one transformation step becomes the input of another
Although it is usually possible to accomplish all of these things with a single, in-process transformation step, doing so may come at the cost of performance or unnecessary complexity. I’m an advocate for using the right tool for the job, and often, the best way to process a load is to let the destination database do some of the heavy lifting.
Using ETL Staging Tables
When the volume or granularity of the transformation process causes ETL processes to perform poorly, consider using a staging table on the destination database as a vehicle for processing interim data results. Staging tables are normally considered volatile tables, meaning that they are emptied and reloaded each time without persisting the results from one execution to the next. Staging tables should be used only for interim results and not for permanent storage.
When using a load design with staging tables, the ETL flow looks something more like this:
- Delete existing data in the staging table(s)
- Extract the data from the source
- Load this source data into the staging table(s)
- Perform relational updates (typically using T-SQL, PL/SQL, or other language specific to your RDBMS) to cleanse or apply business rules to the data, repeating this transformation stage as necessary
- Load the transformed data from the staging table(s) into the final destination table(s)
This load design pattern has more steps than the traditional ETL process, but it also brings additional flexibility as well. By loading the data first into staging tables, you’ll be able to use the database engine for things that it already does well. For example, joining two sets of data together for validation or lookup purposes can be done in most every ETL tool, but this is the type of task that the database engine does exceptionally well. Same thing with performing sort and aggregation operations; ETL tools can do these things, but in most cases, the database engine does them too, but much faster.
Options for Staging Tables
When using staging tables to triage data, you enable RDBMS behaviors that are likely unavailable in the conventional ETL transformation. For example, you can create indexes on staging tables to improve the performance of the subsequent load into the permanent tables. You can run multiple transformations on the same set of data without persisting it in memory for the duration of those transformations, which may reduce some of the performance impact.
Staging tables also allow you to interrogate those interim results easily with a simple SQL query. Further, you may be able to reuse some of the staged data, in cases where relatively static data is used multiple times in the same load or across several load processes.
Also, for some edge cases, I have used a pattern which has multiple layers of staging tables, and the first staging table is used to load a second staging table. This is a design pattern that I rarely use, but has come in useful on occasion where the shape or grain of the data had to be changed significantly during the load process.
Is This Still ETL?
Those who are pedantic about terminology (this group often includes me) will want to know: When using this staging pattern, is this process still called ETL? Typically, you’ll see this process referred to as ELT – extract, load, and transform – because the load to the destination is performed before the transformation takes place. I’ve seen lots of variations on this, including ELTL (extract, load, transform, load). However, I tend to use ETL as a broad label that defines the retrieval of data from some source, some measure of transformation along the way, followed by a load to the final destination. Semantically, I consider ELT and ELTL to be specific design patterns within the broad category of ETL.
Also, keep in mind that the use of staging tables should be evaluated on a per-process basis. The staging ETL architecture is one of several design patterns, and is not ideally suited for all load needs. Any mature ETL infrastructure will have a mix of conventional ETL, staged ETL, and other variations depending on the specifics of each load.
Tips for Using ETL Staging Tables
When you do decide to use staging tables in ETL processes, here are a few considerations to keep in mind:
Separate the ETL staging tables from the durable tables. There should be some logical, if not physical, separation between the durable tables and those used for ETL staging. You’ll get the most performance benefit if they exist on the same database instance, but keeping these staging tables in a separate schema – or perhaps even a separate database – will make clear the difference between staging tables and their durable counterparts. Separating them physically on different underlying files can also reduce disk I/O contention during loads.
Use permanent staging tables, not temp tables. The major relational database vendors allow you to create temporary tables that exist only for the duration of a connection. I typically recommend avoiding these, because querying the interim results in those tables (typically for debugging purposes) may not be possible outside the scope of the ETL process. Also, some ETL tools, including SQL Server Integration Services, may encounter errors when trying to perform metadata validation against tables that don’t yet exist.
Consider indexing your staging tables. Don’t arbitrarily add an index on every staging table, but do consider how you’re using that table in subsequent steps in the ETL load. For some use cases, a well-placed index will speed things up.
Consider emptying the staging table before and after the load. You’ll want to remove data from the last load at the beginning of the ETL process execution, for sure, but consider emptying it afterward as well. Especially when dealing with large sets of data, emptying the staging table will reduce the time and amount of storage space required to back up the database.
Do you need to run several concurrent loads at once? For most loads, this will not be a concern. However, some loads may be run purposefully to overlap – that is, two instances of the same ETL processes may be running at any given time – and in those cases you’ll need more careful design of the staging tables. Typically, staging tables are just truncated to remove prior results, but if the staging tables can contain data from multiple overlapping feeds, you’ll need to add a field identifying that specific load to avoid parallelism conflicts.
Handle data lineage properly. If your ETL processes are built to track data lineage, be sure that your ETL staging tables are configured to support this. Data lineage provides a chain of evidence from source to ultimate destination, typically at the row level. If you track data lineage, you may need to add a column or two to your staging table to properly track this.
While the conventional three-step ETL process serves many data load needs very well, there are cases when using ETL staging tables can improve performance and reduce complexity. A good design pattern for a staged ETL load is an essential part of a properly equipped ETL toolbox.