In SQL Server Integration Services, the lookup component is one of the most frequently used tools for data validation and completion. The lookup component is provided as a means to virtually join one set of data to another to validate and/or retrieve missing values. Properly configured, it is reliable and reasonably fast. To get the most out of this component, one has to be aware of the different SSIS lookup cache settings and how each behaves.
The SSIS lookup cache selection will determine whether and how the distinct lookup values are cached during package execution. It is critical to know how cache modes affect the result of the lookup and the performance of the package, as choosing the wrong setting can lead to poorly performing packages, and in some cases, incorrect results.
In this brief SSIS Basics post, I’ll discuss the three different SSIS lookup cache modes and the behaviors of each.
SSIS Lookup Cache Modes
The SSIS lookup transformation uses a setting called Cache Mode to determine how its data is cached at runtime. Those three modes are full cache, partial cache, and no cache. As shown below, the first tab of the lookup transformation lets you select which cache mode to use.
The full cache mode setting is the default cache mode selection in the SSIS lookup transformation. Like the name implies, full cache mode will cause the lookup transformation to retrieve and store in SSIS cache the entire set of data from the specified lookup location. As a result, the data flow in which the lookup transformation resides will not start processing any data buffers until all of the rows from the lookup query have been cached in SSIS.
The most commonly used cache mode is the full cache setting, and for good reason. The full cache setting has the most practical applications, and should be considered the go-to cache setting when dealing with an untested set of data. With a moderately sized set of reference data, a lookup transformation using full cache mode usually performs well. Full cache mode does not require multiple round trips to the database, since the entire reference result set is cached prior to data flow execution.
There are a few potential gotchas to be aware of when using full cache mode. First, you can see some performance issues – memory pressure in particular – when using full cache mode against large sets of reference data. If the table you use for the lookup is very large (either deep or wide, or perhaps both), there’s going to be a performance cost associated with retrieving and caching all of that data. Also, keep in mind that when doing a lookup on character data, full cache mode will always do a case-sensitive (and in most cases, space-sensitive) string comparison even if your database is set to a case-insensitive collation. This is because the in-memory lookup uses a .NET string comparison (which is case- and space-sensitive) as opposed to a database string comparison (which may be case sensitive, depending on collation). There’s a relatively easy workaround in which you can use the UPPER() or LOWER() function in the pipeline data and the reference data to ensure that case differences do not impact the success of your lookup operation. Again, neither of these present a reason to avoid full cache mode, but should be used to determine whether full cache mode should be used in a given situation.
Full cache mode is ideally useful when one or all of the following conditions exist:
- The size of the reference data set is small to moderately sized
- The size of the pipeline data set (the data you are loading from some source to a destination) is large, is unknown at design time, or is unpredictable
- Each distinct key value(s) in the pipeline data set is expected to be found multiple times in that set of data
When using the partial cache setting for the SSIS lookup cache mode, lookup values will still be cached, but only as each distinct value is encountered in the data flow. Initially, each distinct value will be retrieved individually from the specified source, and then cached. To be clear, this is a row-by-row lookup for each distinct key value(s).
This is a less frequently used cache setting because it addresses a narrower set of scenarios. Because each distinct key value(s) combination requires a relational round trip to the lookup source, performance can be an issue, especially with a large pipeline data set to be compared to the lookup data set. If you have, for example, a million records from your pipeline data source, you have the potential for doing a million lookup queries against your lookup data source (depending on the number of distinct values in the key column(s)). Therefore, one has to be keenly aware of the expected row count and value distribution of the pipeline data to safely use partial cache mode.
Using partial cache mode is ideally suited for the conditions below:
- The size of the data in the pipeline (more specifically, the number of distinct key column) is relatively small
- The size of the lookup data is too large to effectively store in cache
- The lookup source is well indexed to allow for fast retrieval of row-by-row values
As you might guess, selecting no cache mode will not add any values to the lookup cache in SSIS. As a result, every single row in the pipeline data set will require a query against the lookup source. Since no data is cached, it is possible to save a small amount of overhead in SSIS memory in cases where key values are not reused. In the real world, I don’t encounter many cases that require the use of the no cache setting, but for a few edge cases it can be handy.
As such, it’s critical to know your data before choosing this option. Obviously, performance will be an issue with anything other than small sets of data, as the no cache setting requires row-by-row processing of all of the data in the pipeline.
I would recommend considering the no cache mode only when all of the below conditions are true:
- The reference data set is too large to reasonably be loaded into SSIS memory
- The pipeline data set is small and is not expected to grow
- There are expected to be very few or no duplicates of the key values(s) in the pipeline data set (i.e., there would be no benefit from caching these values)
The cache mode, an often-overlooked setting on the SSIS lookup component, represents an important design decision in your SSIS data flow. Choosing the right lookup cache mode directly impacts the fidelity of your results and the performance of package execution. Know how this selection impacts your ETL loads, and you’ll end up with more reliable, faster packages.
Now you got me wondering if using a cache connection manager might speed things up when doing lookups of large amounts of data in a very large lookup table. (like comparing a record to see if it already exists in the archive table for example) And wondering if the lookup is any faster than say a merge join in the dataflow (I think it is) or doing a table to table SQL task left join instead.