Most traditional ETL processes perform their loads using three distinct and serial processes: extraction, followed by transformation, and finally a load to the destination. However, for some large or complex loads, using ETL staging tables can make for better performance and less complexity.
As part of my continuing series on ETL Best Practices, in this post I will some advice on the use of ETL staging tables.
Conventional 3-Step ETL
Extract, transform, and load processes, as implied in that label, typically have the following workflow:
- Retrieve (extract) the data from its source, which can be a relational database, flat file, or cloud storage
- Reshape and cleanse (transform) data as needed to fit into the destination schema and to apply any cleansing or business rules
- Insert (load) the transformed data into the destination, which is usually (but not always) a relational database table
This typical workflow assumes that each ETL process handles the transformation inline, usually in memory and before data lands on the destination. For most ETL needs, this pattern works well. Every enterprise-class ETL tool is built with complex transformation tools, capable of handling many of these common cleansing, deduplication, and reshaping tasks. This three-step process of moving and manipulating data lends itself to simplicity, and all other things being equal, simpler is better.
However, there are cases where a simple extract, transform, and load design doesn’t fit well. Among these potential cases:
- Each row to be loaded requires something from one or more other rows in that same set of data (for example, determining order or grouping, or a running total)
- The source data is used to update (rather than insert into) the destination
- The ETL process is an incremental load, but the volume of data is significant enough that doing a row-by-row comparison in the transformation step does not perform well
- The data transformation needs require multiple steps, and the output of one transformation step becomes the input of another
Although it is usually possible to accomplish all of these things with a single, in-process transformation step, doing so may come at the cost of performance or unnecessary complexity. I’m an advocate for using the right tool for the job, and often, the best way to process a load is to let the destination database do some of the heavy lifting.
Using ETL Staging Tables
When the volume or granularity of the transformation process causes ETL processes to perform poorly, consider using a staging table on the destination database as a vehicle for processing interim data results. Staging tables are normally considered volatile tables, meaning that they are emptied and reloaded each time without persisting the results from one execution to the next. Staging tables should be used only for interim results and not for permanent storage.
When using a load design with staging tables, the ETL flow looks something more like this:
- Delete existing data in the staging table(s)
- Extract the data from the source
- Load this source data into the staging table(s)
- Perform relational updates (typically using T-SQL, PL/SQL, or other language specific to your RDBMS) to cleanse or apply business rules to the data, repeating this transformation stage as necessary
- Load the transformed data from the staging table(s) into the final destination table(s)
This load design pattern has more steps than the traditional ETL process, but it also brings additional flexibility as well. By loading the data first into staging tables, you’ll be able to use the database engine for things that it already does well. For example, joining two sets of data together for validation or lookup purposes can be done in most every ETL tool, but this is the type of task that the database engine does exceptionally well. Same thing with performing sort and aggregation operations; ETL tools can do these things, but in most cases, the database engine does them too, but much faster.
Options for Staging Tables
When using staging tables to triage data, you enable RDBMS behaviors that are likely unavailable in the conventional ETL transformation. For example, you can create indexes on staging tables to improve the performance of the subsequent load into the permanent tables. You can run multiple transformations on the same set of data without persisting it in memory for the duration of those transformations, which may reduce some of the performance impact.
Staging tables also allow you to interrogate those interim results easily with a simple SQL query. Further, you may be able to reuse some of the staged data, in cases where relatively static data is used multiple times in the same load or across several load processes.
Also, for some edge cases, I have used a pattern which has multiple layers of staging tables, and the first staging table is used to load a second staging table. This is a design pattern that I rarely use, but has come in useful on occasion where the shape or grain of the data had to be changed significantly during the load process.
Is This Still ETL?
Those who are pedantic about terminology (this group often includes me) will want to know: When using this staging pattern, is this process still called ETL? Typically, you’ll see this process referred to as ELT – extract, load, and transform – because the load to the destination is performed before the transformation takes place. I’ve seen lots of variations on this, including ELTL (extract, load, transform, load). However, I tend to use ETL as a broad label that defines the retrieval of data from some source, some measure of transformation along the way, followed by a load to the final destination. Semantically, I consider ELT and ELTL to be specific design patterns within the broad category of ETL.
Also, keep in mind that the use of staging tables should be evaluated on a per-process basis. The staging ETL architecture is one of several design patterns, and is not ideally suited for all load needs. Any mature ETL infrastructure will have a mix of conventional ETL, staged ETL, and other variations depending on the specifics of each load.
Tips for Using ETL Staging Tables
When you do decide to use staging tables in ETL processes, here are a few considerations to keep in mind:
Separate the ETL staging tables from the durable tables. There should be some logical, if not physical, separation between the durable tables and those used for ETL staging. You’ll get the most performance benefit if they exist on the same database instance, but keeping these staging tables in a separate schema – or perhaps even a separate database – will make clear the difference between staging tables and their durable counterparts. Separating them physically on different underlying files can also reduce disk I/O contention during loads.
Use permanent staging tables, not temp tables. The major relational database vendors allow you to create temporary tables that exist only for the duration of a connection. I typically recommend avoiding these, because querying the interim results in those tables (typically for debugging purposes) may not be possible outside the scope of the ETL process. Also, some ETL tools, including SQL Server Integration Services, may encounter errors when trying to perform metadata validation against tables that don’t yet exist.
Consider indexing your staging tables. Don’t arbitrarily add an index on every staging table, but do consider how you’re using that table in subsequent steps in the ETL load. For some use cases, a well-placed index will speed things up.
Consider emptying the staging table before and after the load. You’ll want to remove data from the last load at the beginning of the ETL process execution, for sure, but consider emptying it afterward as well. Especially when dealing with large sets of data, emptying the staging table will reduce the time and amount of storage space required to back up the database.
Do you need to run several concurrent loads at once? For most loads, this will not be a concern. However, some loads may be run purposefully to overlap – that is, two instances of the same ETL processes may be running at any given time – and in those cases you’ll need more careful design of the staging tables. Typically, staging tables are just truncated to remove prior results, but if the staging tables can contain data from multiple overlapping feeds, you’ll need to add a field identifying that specific load to avoid parallelism conflicts.
Handle data lineage properly. If your ETL processes are built to track data lineage, be sure that your ETL staging tables are configured to support this. Data lineage provides a chain of evidence from source to ultimate destination, typically at the row level. If you track data lineage, you may need to add a column or two to your staging table to properly track this.
While the conventional three-step ETL process serves many data load needs very well, there are cases when using ETL staging tables can improve performance and reduce complexity. A good design pattern for a staged ETL load is an essential part of a properly equipped ETL toolbox.
Tim, I’ve heard some recently refer to this as “persistent staging area”. I have used and seen various terms for this in different shops such as landing area, data landing zone, and data landing pad. It is in fact a method that both IBM and Teradata have promoted for many years. I’ve followed this practice in every data warehouse I’ve been involved in for well over a decade and wouldn’t do it any other way. I learned by experience that not doing this way can be very costly in a variety of ways.
I would also add that if you’re building and enterprise solution that you should include a “touch-and-take” method of not excluding columns of any structure/table that you are staging as well as getting all business valuable structures from a source rather than only what requirements ask for (within reason). Only with that approach will you provide a more agile ability to meet changing needs over time as you will already have the data available. Remember also that source systems pretty much always overwrite and often purge historical data. So this persistent staging area can and often does become the only source for historical source system data for the enterprise.
Hi Gary, I’ve seen the persistent staging pattern as well, and there are some things I like about it. Retaining an accurate historical record of the data is essential for any data load process, and if the original source data cannot be used for that, having a permanent storage area for the original data (whether it’s referred to as persisted stage, ODS, or other term) can satisfy that need.
Thanks for the article. I’m glad you expanded on your comment “consider using a staging table on the destination database as a vehicle for processing interim data results” to clarify that you may want to consider at least a separate schema if not a separate database. I would strongly advocate a separate database. The nature of the tables would allow that database not to be backed up, but simply scripted. It also reduces the size of the database holding the data warehouse relational tables. I’ve run into times where the backup is too large to move around easily even though a lot of the data is not necessary to support the data warehouse. As a fairly concrete rule, a table is only in that database if needed to support the SSAS solution. I’ve occasionally had to make exceptions and store data that needs to persist to support the ETL as I don’t backup the staging databases.
I’d be interested to hear more about your lineage columns. Each of my ETL processes has an sequence generated ID, so no two have the same number. That number doesn’t get added until the first persistent table is reached. That ETL ID points to the information for that process, including time, record counts for the fact and dimension tables. I can’t see what else might be needed.
@Gary, regarding your “touch-and-take” approach. With few exceptions, I pull only what’s necessary to meet the requirements. I worked at a shop with that approach, and the download took all night. At my next place, I have found by trial and error that adding columns has a significant impact on download speeds. I was able to make significant improvements to the download speeds by extracting (with occasional exceptions) only what was needed. I grant that when a new item is needed, it can be added faster. But there’s a significant cost to that.
I wanted to get some best practices on extract file sizes. Currently, I am working as the Data Architect to build a Data Mart. I am working on the staging tables that will encapsulate the data being transmitted from the source environment. I would like to know what the best practices are on the number of files and file sizes. Right now I believe I have about 20+ file with at least 30+ more to come. In some cases a file just contains address information or just phone numbers. Would these sets being combined assist an ETL tool in better performing the transformations? I have worked in Data Warehouse before but have not dictated how the data can be received from the source. If you could shed some light on how the source could send the files best to assist an ETL in functioning efficiently, accurately, and effectively that would be great.
Thank you for your time.
I am very glad to see I am not the only one out here using staging tables… 🙂 THANK you for your very informative site and post. Now, I wonder if you have any words on raw files? I have one of my ETL solutions where I used raw files instead of persisted temp table in the database. Raw files are supposed to be faster and arguably designed precisely for the use case of temporarily staging your data in between Extract and Load operations. So, what’s your opinion on using raw files vs. temp tables? Advantages or disadvantages of each approach? Your opinion and knowledge on the subject is appreciated in advance. Best, Raphael
Raphael, that’s an excellent question. I wrote a bit about using raw files here: https://www.timmitchell.net/post/2019/09/03/using-raw-files-in-ssis/
Using raw files in SSIS is similar to using staging tables or temp tables in the database engine. Both are designed for storing data for interim processing. I don’t hold either of these as being better than the other; each is a different way to solve the same problem. Using SSIS raw files leans more on the Integration Services engine to handle this interim processing, while moving it to the database engine in temp tables or staging tables offloads that processing to the database engine (in more of an ELT pattern rather than ETL).