This is the first in a series of technical posts on using parent-child architectures in SQL Server Integration Services. The index page for all posts can be found here.
In this post, I will provide an overview of the architecture and describe the benefits of implementing a parent-child design pattern in SSIS structures.
The simplest definition of SSIS parent-child architecture is that it consists of packages executing other packages. In SSIS, the package is the base executable; it is the most granular component that can be executed independently1. Every version of SSIS includes the ability for one package to execute another package through the use of one of the following:
- The Execute Package Task
- T-SQL commands
- The Execute Process Task (using the dtexec.exe utility)
Though the reference to parent-child architecture implies that there are exactly two layers to this design, that does not have to be the case. You may have a design where a package executes a package which executes a package, and so forth. Although there may be a hard limit to how deeply nested a parent-child architecture may go, I have never encountered such a limitation. I have found it useful on a few occasions to go deeper than two levels in this type of architecture, particularly when designing a formal ETL framework (to be discussed further in a future post in this series). In cases where greater than two levels exist, finding the right terminology for those layers is important. You can refer to them by patriarchy (grandparent/parent/child) or by cardinality (level 1, level 2, level n), as long as you remain consistent – especially in your documentation – with those references.
Conceptually, a parent-child architecture is a form of code abstraction. By encapsulating ETL actions into discrete units of work (packages), we’re creating a network of moving parts that can be developed, tested, and executed independently or as part of a larger collection.
As I mentioned in my introductory post, there are several benefits to using parent-child structures in SSIS.
Reusability. In any ETL environment of significant size or complexity, it’s quite normal to discover common ETL behaviors that are reusable across different implementations. For a concrete example of this: In my spare time, I’m working on an ETL application that downloads XML files from a Major League Baseball web service. There are files of various formats, and each file format is processed a different way, but with respect to the download of the files, I always perform the same set of operations: create a log entry for the file; attempt to download the file to the local server; log the result (success or failure) of the download operation; if the download has failed, set the HasErrors variable on the main package. If I were to load this behavior into a group of tasks in the package for each XML format, I’d have five different copies of the same logic. However, by building a parameterized child package that performs all of these core functions, I only have to build the file download/logging logic once, and execute the resulting package with the appropriate parameters each time I need to download a file.
Easier development. Working with large and complex SSIS packages can be a pain. The larger the SSIS packages, the longer it takes for the BIDS or SSDT environment to do its validation checks when the package is opened or modified. Further, when multiple ETL developers are working on the same project, it is much easier to break apart the project into discrete units of work when using numerous smaller SSIS packages.
Easier testing and debugging. When working through the test and debug cycles during and after initial development, it’s almost always easier to test and debug smaller packages. To test a single task that resides in a large SSIS package would require either running the task by itself manually in the Visual Studio designer, or disabling all of the other tasks and redeploying the package. When working with packages that each perform one unit of work, one can often simply execute the package to be tested through the normal scheduling/execution mechanism.
Clarity of purpose. An architecture that uses small, single-operation packages lends itself to clarity of purpose by virtue of naming. When browsing a list of deployed packages, it is much more clear to see package names such as “Load Customers Table”, “Merge Product Table”, and “Remove Duplicates in Vehicle Table” than to find do-everything packages with names like “Load Production DB”, “Update DW”, etc.
Performance. In some cases, breaking out multi-step SSIS package can bring some performance gains. One distinct case that comes to mind is using a distributed architecture, where packages within a single execution group are executed on multiple servers. By distributing packages across different SQL Server machines (either physical or virtual), it may be possible to improve performance in cases where the processing load on a single SSIS server has become a bottleneck. I want to emphasize that using a parent-child architecture does not arbitrarily improve performance, so this should not be used as a silver bullet to improve a poorly performing group of packages.
As I mentioned earlier, there are three tools that can be used to execute a package from within another package.
The execute package task. This is the easiest and most common means of executing a package from within another. This task can trigger the execution of a package stored on the file system, deployed to MSDB or the SSIS catalog, or residing in the same project. If using SSIS 2012 with catalog deployment mode, you can also use the execute package task to pass parameter values from the parent package to the child package. It is important to note that the execute package task behaves differently in SSIS 2012 than it does in older versions.
T-SQL commands (via the execute SQL task). For SSIS projects using project deployment model in SSIS 2012, the built-in stored procedures in the SSIS catalog can be used to execute packages. This method for executing packages, like the execute package task, allows you to specify runtime parameters via T-SQL code. One significant advantage of using T-SQL commands to execute packages is that, unlike the execute package task, you can use expressions to set at runtime the name of the package to be executed. This is useful in cases where you are iterating over a list of packages that may not be known at runtime, such as a pattern found in ETL frameworks.
dtexec.exe (via the execute process task). Using this method allows you to trigger package execution via the command-line application dtexec.exe. Although this method is typically used to execute packages in a standalone environment – for example, when using third-party scheduling tools to orchestrate package execution – but dtexec can also be used within SSIS by way of the execute process task. As an aside, I rarely use dtexec to execute child packages – in most cases, it’s easier to use either the execute package task or T-SQL commands to execute one package from within another.
I’ll also briefly mention dtexecui.exe. This is a graphical tool that serves the same purpose as dtexec.exe, except that the former exposes functionality via a graphical user interface rather than forcing the user to use command-line parameters for configuration. Except for this brief mention, I’ll not cover dtexecui.exe in this discussion of parent-child architecture, as that tool is intended for interactive (manual) execution of packages and is not a suitable tool for executing one package from within another.
Parent-Child architecture in the real world
To illustrate how this can work, let’s model out a realistic example. Imagine that we have charge over the development of a sizeable healthcare database. In addition to our production data, we’ve got multiple environments – test, development, and training – to support the development life cycle and education needs. As is typical for these types of environments, these databases need to be refreshed from the production database from time to time.
The refresh processes for each of these environments will look similar to the others. In each of them, we will extract any necessary data for that environment, retrieve and restore the backup from production, and import the previously extracted data back into that environment. Since we are dealing with sensitive healthcare data, the information in the training database needs to be sufficiently anonymized to avoid an inappropriate disclosure of data. In addition, our test database needs to be loaded with some test cases to facilitate testing for potential vulnerabilities. Even though there are some differences in the way each environment is refreshed, there are several points of shared – and duplicate – behavior, as shown below (with the duplicates in blue).
Instead of using duplicate static elements, we can eliminate some code redundancy and maintenance overhead by encapsulating those shared behavior into their own container – specifically, a parameterized package. In doing so, we can avoid having multiple points of administration when (not if) we need to make adjustments to those common elements of the refresh process. The updated architecture uses parameters (or package configurations, if using package deployment mode in SSIS 2012 or any older version of SSIS) to pass in the name of the database environment to refresh.
As shown, we’ve moved those shared behaviors into a separate package (RefreshDB), the behavior of which is driven by the parameters passed into it. The duplicate code is gone. We now have just one SSIS package, instead of three, that need to be altered when those common behaviors change. Further, we can individually test and debug the child package containing those common behaviors, without the additional environment-specific operations.
Note that we haven’t reduced the number of packages using this architecture. The goal isn’t fewer packages. We’re aiming for a modularized, easy-to-maintain design, which typically results in a larger number of packages that each perform just a few (and sometimes just one) functions. In fact, in the parent-child architecture shown above, we could even further refine this pattern by breaking out the individual operations in the RefreshDB package into packages of their own, which would be practical for cases in which those tasks might be executed apart from the others.
Exceptions to the rule
Are there cases in which parent-child structures do not add value? Certainly. A prime example of such a case is a small, simple package developed for a single execution with no expectation that its logic will be reused. I call these throwaway packages. Because of their single-use nature, there is likely little value in going through the effort to building a parent-child architecture around their business logic.
In my next post in this series, I’ll work through the mechanics of using a parent-child pattern in SSIS 2005 or SSIS 2008.
1 Technically, there are lower-level items in the SSIS infrastructure that can be executed independently. For example, from the BIDS or SSDT design surface, one can manually execute a single task or container within a package. However, when deploying or scheduling the execution of some ETL behavior, the package is the lowest level of granularity that can be addressed.
Edit: Corrected typo on one of the graphics.