SSIS Parent-Child Architecture in Package Deployment Mode

This is the second in a series of posts about SSIS parent-child architecture.  You can find the index page here.

In my first post in this series, I covered the essentials of SSIS parent-child design patterns, including discussing a number of advantages for using such patterns.  In this post, I’m going to demonstrate how to build parent-child structures using package deployment mode.  This deployment mode is available (though not the default behavior) in SSIS 2012 and 2014, and is the only deployment mode available in SSIS 2008 and earlier.  You might wonder why I might cover this deployment mode, since it’s no longer the preferred method for SSIS 2012 and 2014, and older versions that require package deployment mode are at least six years and two versions old.  I share this information for two main reasons.  First of all, even though package deployment mode is less commonly used than project deployment mode, there are many organizations still using it – even in recent versions of SSIS – due to legacy packages that were imported from older versions.  Additionally, it seems everywhere I go, folks are still running mission-critical packages on older versions of SSIS (especially 2008), and when I demonstrate features in SSIS 2012 or 2014, I’m frequently asked about the equivalent behavior in those older versions.

Under the package deployment model, there are generally three ways to execute one package from another package:

  • Execute Package Task – This is the built-in tool in SSIS that allows the developer to execute a package from another package
  • dtexec.exe – This is the command-line tool for executing packages
  • dtexecui.exe – This is the user-friendly version of dtexec, with a full UI for executing packages

Each of these three methods can be used to execute packages stored on either the file system or in MSDB.  Because this post addresses parent-child structures, which typically involves packages executing other packages, I’ll focus on the execute package task.

Execute Package Task

Using the execute package task in package deployment mode is a relatively straightforward process; simply drag the execute package task onto the control flow surface and set the configuration options.  As shown in the screenshot below, executing a package stored in MSDB requires only a connection to the SQL Server instance and the full path of the package to be executed.  Optionally, you can also specify the password (if the child package to be executed is password protected) and choose to execute the package in a separate process.

image

Similar options are available for a package deployed to the file system.  Note that you must create a file connection to connect to the package stored in the file system as shown below.

image

Under the Expressions tab (shown below), you can optionally configure dynamic values using SSIS expression language.  For example, you could add some logic that would dynamically specify the name of the package, which can be useful when building a work pile pattern (which I will cover in a future post).

image

Passing Values

So far, all straightforward, with very little to configure, right?  For a simple parent-child setup, this may be all you need to do.  But what if you want to make this arrangement more flexible – for example, you need to add some values to pass from the parent package to the child package.  A quick review of the execute process task reveals exactly zero hints on how to accomplish this.  By design, the parent package does not explicitly pass values down to child packages; rather, the child package declares – through package configurations – which variables it will consume from the parent.  So, for sharing variables from parent to child, the task list will be:

  • In the parent package, create and populate the variables to be shared with the child package
  • In the child package, create variables that will be populated by values from the parent
  • In the child package, create package configurations – one per variable – to populate the child variables from those in the parent

Creating the variables is very simple.  Start with the parent package, creating a couple of variables – named vIntVar and vStringVar – and setting their values, as shown below.

image

Next, in the child package (shown below), create two corresponding parameters which will be used to store the variable values from the parent package.  Note that in this example I have not assigned a default value to the String parameter, since the parent package will be supplying that value at runtime.  For this example, you’d still have to supply a default value for the Integer (Int32) parameter, because this data type requires a non-NULL default value.

image

That’s the easy part.  The more obscure and difficult part is adding in the configurations for these variables. When working in package deployment mode in SSIS 2014 or 2012 (or on any package on SSIS 2008 or 2005), package configurations are used to allow child packages to consume parameter values from the parent package.  The package configurations, defined on the child packages, specify which values are to be received from the parent package, and which variables those values should be written into.   To create the package configuration in the child package, right click on an empty space in the control flow and select Package Configurations.  You’ll need to click the checkbox to Enable package configurations, if you haven’t previously used configurations in this package.  The empty Package Configurations Organizer window is shown below.

image

To create a new package configuration, click the Add… button to open the Package Configuration Wizard window.  On the Configuration type setting, choose Parent package variable, and type in the name of the first variable to configure (vIntVar) as shown below.

image

Wait – I have to type in that variable name?  Why not have a drop-down list to let me choose it?  That’s an excellent question, and one that I too asked when I first started with package configurations.  The reason you have to type in the variable name is that the child package has no explicit relationship to the parent package.  Theoretically, this child package could be executed from any parent package – it’s not bound to any particular parent.

Once you’ve typed in the name of the parent package variable, click Next > and go to the next configuration page.  Here, you specify where to put this value from the parent package variable.  As shown below, you will use the package tree view to specify the variable to which you’re going to write this value.  Drill into the variable name, expand the Properties beneath that variable, and click on Value to select it as the target of this configuration assignment.

Just as a side note, if you drill into the folders below the list of variables, you can see that you don’t necessarily have to use an SSIS variable in the child package to capture the value from the parent package; you can write directly to other properties of the package (such as connection strings).  For a variety of reasons – most notably, easier troubleshooting – I recommend that you always write a value from a parent package configuration into an SSIS variable in your child package.  You can always assign that child package variable value to a built-in value – for example, a connection string – using an expression elsewhere in the package.

image

After selecting Variables –> vIntVar –> Properties –> Value, click Next > and supply a name for this configuration as shown below.  To keep things simple for this example, I’ve simply named the configuration with the same name as the variable (though you can give it a different name, if you prefer).

image

For the other variable (vStringVar), you can repeat this process, mapping the variable of that name in the parent to the variable of the same name in the child package.

With both configurations in place, the package is ready to be run for testing.  On the child package, I’ve created a script task that will open a Windows message box to show the variable values to confirm that they are coming from the parent package.  As shown below, the message box window from the script task in the child package confirms that the child package parameter values are supplied by the parent package.

image

As I discussed in my previous post, you now have a simple package infrastructure that lends it self to easier development and troubleshooting, streamlined error handling, and less code repetition.  It’s difficult to see with just two packages, but with an ETL system with dozens or hundreds of packages, the advantages of these parent-child patterns quickly become obvious.

Parting Thoughts

A few things to keep in mind when using the execute package task in package deployment mode:

  • The parent package variable and child package variables do not have to have the same names.  I kept the same names in this example for clarity, but there’s no technical requirement that you name those the same.  The binding between parent package variable and child package variable is established in the configuration, not by virtue of naming.
  • If you create a parent package variable configuration for a variable that does not exist in the parent package, the package execution will not fail for that reason; it will simply leave intact the default value (if any) for that child package variable.  Other failures might occur related to that missing parent package variable (for example, a missing connection string if you are using configurations to pass connection strings), but a missing parent package variable alone won’t cause a package failure.
  • The transfer of values from parent to child is a one-way transfer.  The information is passed by value, meaning that a copy of the value (not a reference to the original value) is given from parent to child.  Therefore, if you modify the value of the child package variable that was originally loaded from a parent package configuration, it has no impact on the original value in the parent package.  I will demonstrate in a later post in this series how to pass values by reference from the child package back to the parent package.

Conclusion

In this post, I’ve briefly demonstrated a simple pattern for implementing a parent-child architecture using the package deployment mode.  In my next post, I’ll go into depth on using this architecture in the project deployment mode in SSIS 2012 and 2014.

SQL Saturday Lisbon

sqlsatportIt’s a little over a week until this year’s SQL Saturday festivities kick off in Lisbon, Portugal, and I’m very excited to be a part of it.  Registration is nearly full, so if you’re in the area and are planning on attending, register now!

For this event, I’m delivering a full day workshop entitled “Real World SSIS: A Survival Guide”, during which I’ll share design patterns and practical lessons I’ve learned over my 10-ish years in the BI/ETL space.  This workshop will be held on Thursday, April 10th (the Thursday prior to the main SQL Saturday event), and there are still some seats available.  You can register for this workshop online.  I’ve also recorded a teaser video of what’s to come in this workshop.

In addition to the full-day workshop on Thursday, I’ll also be presenting two, one-hour sessions on Saturday.  I’ll be sharing “Handling Errors and Data Anomalies in SSIS” and “15 Quick Tips for SSIS Performance” during the regular SQL Saturday event.

If you plan on attending SQL Saturday in Lisbon, please stop by and say hello!  I’m looking forward to seeing you there.

Parent-Child SSIS Architecture

This is the first in a series of technical posts on using parent-child architectures in SQL Server Integration Services.  The index page for all posts can be found here.

In this post, I will provide an overview of the architecture and describe the benefits of implementing a parent-child design pattern in SSIS structures.

Definition

The simplest definition of SSIS parent-child architecture is that it consists of packages executing other packages.  In SSIS, the package is the base executable; it is the most granular component that can be executed independently1.  Every version of SSIS includes the ability for one package to execute another package through the use of one of the following:

  • The Execute Package Task
  • T-SQL commands
  • The Execute Process Task (using the dtexec.exe utility)

Though the reference to parent-child architecture implies that there are exactly two layers to this design, that does not have to be the case.  You may have a design where a package executes a package which executes a package, and so forth.  Although there may be a hard limit to how deeply nested a parent-child architecture may go, I have never encountered such a limitation.  I have found it useful on a few occasions to go deeper than two levels in this type of architecture, particularly when designing a formal ETL framework (to be discussed further in a future post in this series).  In cases where greater than two levels exist, finding the right terminology for those layers is important.  You can refer to them by patriarchy (grandparent/parent/child) or by cardinality (level 1, level 2, level n), as long as you remain consistent – especially in your documentation – with those references.

Conceptually, a parent-child architecture is a form of code abstraction.  By encapsulating ETL actions into discrete units of work (packages), we’re creating a network of moving parts that can be developed, tested, and executed independently or as part of a larger collection.

Benefits

As I mentioned in my introductory post, there are several benefits to using parent-child structures in SSIS.

Reusability.  In any ETL environment of significant size or complexity, it’s quite normal to discover common ETL behaviors that are reusable across different implementations.  For a concrete example of this: In my spare time, I’m working on an ETL application that downloads XML files from a Major League Baseball web service.  There are files of various formats, and each file format is processed a different way, but with respect to the download of the files, I always perform the same set of operations: create a log entry for the file; attempt to download the file to the local server; log the result (success or failure) of the download operation; if the download has failed, set the HasErrors variable on the main package.  If I were to load this behavior into a group of tasks in the package for each XML format, I’d have five different copies of the same logic.  However, by building a parameterized child package that performs all of these core functions, I only have to build the file download/logging logic once, and execute the resulting package with the appropriate parameters each time I need to download a file.

Easier development.  Working with large and complex SSIS packages can be a pain.  The larger the SSIS packages, the longer it takes for the BIDS or SSDT environment to do its validation checks when the package is opened or modified.  Further, when multiple ETL developers are working on the same project, it is much easier to break apart the project into discrete units of work when using numerous smaller SSIS packages.

Easier testing and debugging.  When working through the test and debug cycles during and after initial development, it’s almost always easier to test and debug smaller packages.  To test a single task that resides in a large SSIS package would require either running the task by itself manually in the Visual Studio designer, or disabling all of the other tasks and redeploying the package.  When working with packages that each perform one unit of work, one can often simply execute the package to be tested through the normal scheduling/execution mechanism.

Clarity of purpose. An architecture that uses small, single-operation packages lends itself to clarity of purpose by virtue of naming.  When browsing a list of deployed packages, it is much more clear to see package names such as “Load Customers Table”, “Merge Product Table”, and “Remove Duplicates in Vehicle Table” than to find do-everything packages with names like “Load Production DB”, “Update DW”, etc.

Performance. In some cases, breaking out multi-step SSIS package can bring some performance gains.  One distinct case that comes to mind is using a distributed architecture, where packages within a single execution group are executed on multiple servers.  By distributing packages across different SQL Server machines (either physical or virtual), it may be possible to improve performance in cases where the processing load on a single SSIS server has become a bottleneck.  I want to emphasize that using a parent-child architecture does not arbitrarily improve performance, so this should not be used as a silver bullet to improve a poorly performing group of packages.

The Tools

As I mentioned earlier, there are three tools that can be used to execute a package from within another package.

The execute package task.  This is the easiest and most common means of executing a package from within another.  This task can trigger the execution of a package stored on the file system, deployed to MSDB or the SSIS catalog, or residing in the same project.  If using SSIS 2012 with catalog deployment mode, you can also use the execute package task to pass parameter values from the parent package to the child package.  It is important to note that the execute package task behaves differently in SSIS 2012 than it does in older versions.

T-SQL commands (via the execute SQL task).  For SSIS projects using project deployment model in SSIS 2012, the built-in stored procedures in the SSIS catalog can be used to execute packages.  This method for executing packages, like the execute package task, allows you to specify runtime parameters via T-SQL code.  One significant advantage of using T-SQL commands to execute packages is that, unlike the execute package task, you can use expressions to set at runtime the name of the package to be executed.  This is useful in cases where you are iterating over a list of packages that may not be known at runtime, such as a pattern found in ETL frameworks.

dtexec.exe (via the execute process task).  Using this method allows you to trigger package execution via the command-line application dtexec.exe.  Although this method is typically used to execute packages in a standalone environment – for example, when using third-party scheduling tools to orchestrate package execution – but dtexec can also be used within SSIS by way of the execute process task.  As an aside, I rarely use dtexec to execute child packages – in most cases, it’s easier to use either the execute package task or T-SQL commands to execute one package from within another.

I’ll also briefly mention dtexecui.exe.  This is a graphical tool that serves the same purpose as dtexec.exe, except that the former exposes functionality via a graphical user interface rather than forcing the user to use command-line parameters for configuration.  Except for this brief mention, I’ll not cover dtexecui.exe in this discussion of parent-child architecture, as that tool is intended for interactive (manual) execution of packages and is not a suitable tool for executing one package from within another.

Parent-Child architecture in the real world

To illustrate how this can work, let’s model out a realistic example.  Imagine that we have charge over the development of a sizeable healthcare database.  In addition to our production data, we’ve got multiple environments – test, development, and training – to support the development life cycle and education needs.  As is typical for these types of environments, these databases need to be refreshed from the production database from time to time.

The refresh processes for each of these environments will look similar to the others.  In each of them, we will extract any necessary data for that environment, retrieve and restore the backup from production, and import the previously extracted data back into that environment.  Since we are dealing with sensitive healthcare data, the information in the training database needs to be sufficiently anonymized to avoid an inappropriate disclosure of data.  In addition, our test database needs to be loaded with some test cases to facilitate testing for potential vulnerabilities.  Even though there are some differences in the way each environment is refreshed, there are several points of shared – and duplicate – behavior, as shown below (with the duplicates in blue).

ParentChild-Duplicate

Instead of using duplicate static elements, we can eliminate some code redundancy and maintenance overhead by encapsulating those shared behavior into their own container – specifically, a parameterized package.  In doing so, we can avoid having multiple points of administration when (not if) we need to make adjustments to those common elements of the refresh process.  The updated architecture uses parameters (or package configurations, if using package deployment mode in SSIS 2012 or any older version of SSIS) to pass in the name of the database environment to refresh.

ParentChild-HighLevel

As shown, we’ve moved those shared behaviors into a separate package (RefreshDB), the behavior of which is driven by the parameters passed into it.  The duplicate code is gone.  We now have just one SSIS package, instead of three, that need to be altered when those common behaviors change.  Further, we can individually test and debug the child package containing those common behaviors, without the additional environment-specific operations.

Note that we haven’t reduced the number of packages using this architecture.  The goal isn’t fewer packages.  We’re aiming for a modularized, easy-to-maintain design, which typically results in a larger number of packages that each perform just a few (and sometimes just one) functions.  In fact, in the parent-child architecture shown above, we could even further refine this pattern by breaking out the individual operations in the RefreshDB package into packages of their own, which would be practical for cases in which those tasks might be executed apart from the others.

Exceptions to the rule

Are there cases in which parent-child structures do not add value?  Certainly.  A prime example of such a case is a small, simple package developed for a single execution with no expectation that its logic will be reused.  I call these throwaway packages.  Because of their single-use nature, there is likely little value in going through the effort to building a parent-child architecture around their business logic.

Up Next

In my next post in this series, I’ll work through the mechanics of using a parent-child pattern in SSIS 2005 or SSIS 2008.

1 Technically, there are lower-level items in the SSIS infrastructure that can be executed independently.  For example, from the BIDS or SSDT design surface, one can manually execute a single task or container within a package.  However, when deploying or scheduling the execution of some ETL behavior, the package is the lowest level of granularity that can be addressed.

Edit: Corrected typo on one of the graphics.

SSIS and PowerPivot training in Baton Rouge

I’m happy to announce that I’ll be teaming up with my Linchpin People colleague Bill Pearson for a day of BI and SSIS training next month.  On Wednesday, February 12th, we’ll each be delivering a full-day presentation in Baton Rouge, after which we’ll be joining the Baton Rouge SQL Server User Group for their monthly meeting.

SSIS Training with Tim Mitchell

I’ll be presenting Real World SSIS: A Survival Guide, which is aimed at beginning-to-intermediate SSIS developers.  In this day-long training session, I’ll be sharing and demonstrating many of the ETL lessons that I’ve learned in my 10+ years working in the SQL Server business intelligence ecosystem.

At the same time, Bill Pearson will be delivering Practical Self-Service BI with PowerPivot for Excel, which will provide a crash course for those who are new to PowerPivot.  Following these day-long presentations, Bill will also share more on PowerPivot at the Baton Rouge SQL Server User Group that evening.

Registration for both of these day-long courses is currently open, and early-bird pricing is available for a limited time.  If you’re around the Baton Rouge area and are interested in learning more about SSIS or PowerPivot, we’d love to have you join us next month!

New Blog Series: Parent-Child Architecture in SSIS

I’m kicking off a new series of blog posts discussing the topic of parent-child architectures in SQL Server Integration Services.  The links to the posts in this series are below.

elephant I still remember the first SSIS package I ever deployed to a production environment.  It was the mid-2000s, and I was working on a large data migration project which would take my then-employer, an acute care hospital, from an old UNIX-based system into a modern SQL Server-based OLTP back end.  The entire solution, which pushed around a few hundred million rows of data, was completely contained in a single SSIS package.  And this thing was HUGE.  When I say huge, I mean that the package metadata alone was 5mb in size.  I had a bunch of duplicate code in there, and when I opened or modified the package, it took sometimes a minute or more to go through the validation for the dozens of different tasks and data flows.  In hindsight, I can admit that it was not a well-designed architecture.

Fast forward about a decade.  Having learned some lessons – the hard way – about ETL architecture, I’ve relied on a completely different way of thinking.  Rather than relying on a few, do-everything SSIS packages, I prefer to break out tasks into smaller units of work.  In using more packages that each do just one thing, I’ve discovered that this architecture is:

  • Easier to understand
  • Simpler to debug
  • Much easier to distribute tasks to multiple developers
  • In some cases, better performing

As part of my role as an independent consultant, I also do quite a bit of training, and in those training sessions the topic of parent-child ETL architecture comes up quite often.  How many packages should I have?  Should we have lots of small SSIS packages, or fewer, larger packages?  This is also a topic on which I find a lot of questions on SQL Server discussion forums as well.

To share my experience on this topic, I’m starting a new series of post discussing parent-child architectures in SSIS.  As part of this series, I will cover:

  • Overview of parent-child architecture in SSIS
  • Parent-child architecture in SSIS 2005 and 2008
  • Parent-child architecture in SSIS 2012
  • Passing configuration values from parent to child package
  • Passing values from child packages back to the parent
  • Error handling and logging in parent-child structures
  • Parent-child architectures in an ETL framework

I’m looking forward to writing this series over the next few months.  As always I look forward to your feedback.

Continue Package Execution After Error in SSIS

When it comes to ETL, I’m a pessimist.  Until proven otherwise, I assume that all data is bad, all connections are volatile, and all transformation logic is suspect.  As such, I spent a lot of time addressing how to prepare for and handle errors in the ETL pipeline with SSIS.  Building packages to expect and properly handle errors is a key component in a well-performing, durable, trustworthy ETL infrastructure.

SQL Server Integration Services has a number of built-in tools for handling errors and other anomalies.  Among these:

  • Event handlers, defined at the package, container, or task level
  • Precedence constraints on the data flow to change the execution path in the event of an error
  • Error outputs on sources and destinations in the data flow to redirect errors on a row-by-row basis
  • Fail the task (and its ancestors), which is the default behavior

All of these options provide a way to handle or otherwise report the error.  But what if your scenario requires that any error is simply ignored?  SSIS can do that, too.  In this post, I’ll show you how to use a particular attribute of event handlers to prevent the propagation of errors in SSIS executions.

Scenario

Continue after error. It's not that hard.

Continue after error. It’s not that hard.

Let’s say that we have an SSIS package that processes a series of flat files using an instance of the ForEach Loop Container.  The expected behavior is that we specify the directory of files to be processed, and the loop container will process the specified file(s) in that directory.  If a file exists in that directory that fails to process – perhaps its security settings do not allow it to be read by the account executing the SSIS package, or the file is improperly formatted – the default behavior is that the loop container would fail, which would fail the package and any parent package that executes it.  This is known as propagation (remember that word), which can be visualized as a “bubbling up” of errors.  In that way, propagation of errors is not unlike the way exceptions will bubble up in C# or VB.NET programming.

Now in this same scenario, let’s assume that our business needs dictate that we don’t have to successfully process all of the files for the ETL load to be considered a success, and having a partial load of data has greater business value than failing the entire load if one file fails to load.  In that case, we’ll want to override the default behavior of our package to allow certain elements to fail without affecting the outcome of the package.

As shown below, I’ve set up a package for our scenario in which we truncate the output staging table, and then loop through a specified directory to process each text file in that directory.

image

Notice that I’ve got precedence constraints set up after the data flow task in the loop container.  This will allow me to log either a success or failure of a load operation on a file-by-file basis.  The data flow task, which will be executed once per file in the source directory, will attempt to process each text file in that directory as shown below.

image

In this updated scenario, I don’t want to allow a failed file load to interrupt the load of the remainder of the files, so I’m going to make a couple of modifications.  First of all, I’ll create an error event handler for the data flow task.

image

You’ll notice that there are no tasks in the error event handler for the data flow task (DFT Load File).  Although you can add some custom logic here to execute upon error, you don’t necessarily need to do so.  However, creating the error event handler will expose a particular setting we’ll need to prevent the propagation of any errors.

While the error event handler shown above is still in view, open the list of SSIS variables.  Note that you’ll also have to set the variables window to show system variables, which are hidden by default.  In the list of variables, scroll down until you find a variable named Propagate.  This Boolean value is the setting that specifies whether errors within a given executable to bubble up to the ancestor tasks, containers, and packages.  To prevent these errors from bubbling up, I’ll change the value of the Propagate variable from True to False.

SNAGHTMLc26f2f4

The net result is that any error in this data flow task will still be shown as an error in this task, but that error will short circuit and will not be passed along to any executables further up the chain.

Final Thought

You’ll notice that in the control flow of this package, I included two different paths from the data flow task – one with the On Success constraint, and the other with the On Failure constraint.  I’ve done this to allow the package to separately log the file-by-file results (success or failure, along with row count).  The reason I’ve shown this is because I want to emphasize that, in most situations, you should be logging any error such as this – even if your package is built not to fail when an error is encountered.  In my opinion, most any operation that it attempted, even if it doesn’t affect the outcome of the package, should be logged – especially in the event of failure.

Conclusion

In this post I’ve demonstrated how the built-in components in SSIS can be used to allow for the continued operation after the failure of one executable in a package.  By modifying the Propagate system variable in the error event handler of a volatile executable, you can prevent the automatic failure of upstream package elements, allowing continued execution after the failure of noncritical operations.

Upcoming SQL Saturday Precons

SQLSatI’m happy to announce that I’ll be delivering three, one-day preconference seminars this summer prior to three different SQL Saturday events:

For each of these events, I’ll be delivering a full day of content entitled “Real World SSIS: A Survival Guide.”  This day of content consists of many lessons that I’ve learned – many of which were learned the hard way – through my decade or so in this business.  I’ve got lots of demos to illustrate the concepts we’ll be covering.

If you are able to make it to any of these SQL Saturday events, I’d be honored if you’d join me for one of these talks.  Registration is open for all three seminars, as well as each of the SQL Saturday events.  I hope to see you there!

Using the SSIS Object Variable as a Result Set Enumerator

tmitch2In the first post in this series, I covered the basics of object typed variables in SQL Server Integration Services, along with a brief examination of some potential use cases.  In this installment, I’m going to illustrate the most common use of object typed variables in SSIS: using an object variable as an ADO recordset within a loop container to perform iterative logic.

Before we examine the how, let’s talk about the why.  Although this is not a design pattern you’ll have to use every day, there are any number of cases that would lend themselves to building and using an ADO recordset enumerator:

  • You need to create a series of export files – one per client – showing that client’s charges for a given period.
  • You’re dealing with a very large set of data, and/or your processing hardware has limited resources.  You want to explore breaking up the workload into smaller chunks to be processed serially.
  • You are performing a data load operation, and want to design the package in such a way that the loaded data can be immediately used as a validation source in the same package execution.

For cases such as these (among others), using this design pattern can be an excellent way to address your ETL needs.

Design Pattern

At a high level, this design pattern will have three moving parts:

  • A relational query used to populate the object variable (thus transforming its internal type into an ADO recordset)
  • A For Each Loop container to loop through the list stored in this variable
  • Some business logic for each value (or set of values) in each row of the object variable

Note that while the first two moving parts I mentioned will be relatively consistent from one package to another, the business logic component will, by nature, vary greatly from one package to another.  For the purposes of this post, I’m purposefully keeping my business logic piece simple so as to not distract from the larger design pattern.

For my sample data, I’m  going to deal with a data domain that is near and dear to my heart: baseball.  In this case I want to get a list of all postseason baseball games, and for each game, create an export file detailing the at-bat statistics for that game.  Because I don’t know at design time how many games will be played in the postseason, I can’t simply hard-code my outputs – I need design the ETL in such a way that the data will dictate, at runtime, the number of output files and their respective filenames.

Configuring and Populate the Object Variable

The first thing I’ll do in my demo package is set up an SSIS variable, giving it the data type of Object.  As shown below, I’m using the SSIS variable named [GameList] as the object typed variable, which will store the ADO recordset list of playoff game IDs that should be processed.  Also included is a variable specifying the directory to which the output files will be written, as well as a variable to store the individual game ID for each iteration of the loop.

variables

Next up, I’m going to add an instance of the Execute SQL Task to the control flow of my package, typing in my query to select the IDs of the playoff games from the database.  In the settings for this task shown below, you’ll also see in the highlighted portion that I’ve changed the behavior of the Result Set to use Full result set (remember the default is None, which would expect no data rows to be returned).  By setting this behavior, I’m configuring the task to expect a result set to be returned.

fullresultset

When I configure the Result Set setting in this way, I also need to indicate where those results should end up – specifically, I have to indicate which object typed variable will store these results.  In the Result Set tab of the same task, I’ll set the variable name to use the [GameList] variable I set up in the previous step.  Also note that the result set name should always be 0 in this case.

resultsetmapping

What I’ve done here is quite simple, and required no code (other than the SQL statement, of course).  What’s happening behind the scenes is a little more complex, however.  At runtime when the Execute SQL Task is executed, the [GameList] variable will be instantiated as a new object of type ADO recordset.  Note that this action will not change the data type shown in SSIS; even though the in-memory object will be configured as an ADO recordset, it will still show up as an object type variable in the designer.  This ADO recordset object will then be loaded with the resulting records, if any, from the query I used in the Execute SQL Task.

Using the SSIS Variable as an Enumerator

My next step will be to consume that list, processing each game ID in turn to extract the data I need.  To handle this, I’ll add a For Each Loop container to the control flow, and connect the previously configured instance of Execute SQL Task to this new container.  When I configure the properties for the loop container, in the Collection tab I’m presented with several different options for the enumerator (the list that controls how many times the logic within the loop will be executed).  Since I’m working from the ADO recordset list created in the previous step, I’m going to select Foreach ADO Enumerator, and use the variable drop down list to select the [GameList] object variable.  I also set the Enumeration Mode to use Rows in the first table, which is the only option I can use when working with a ADO recordset (note that we have more options when working with an ADO.NET recordset, which I plan to cover in a future post).

foreachconfig

With the collection tab set to use my object variable as an enumerator, I’ll next jump over to the Variable Mappings tab.  It is on this tab where I will align fields in the record set with variables in the package.  As shown below, I’m only expecting one column to be returned, and for each iteration of the loop, this value will be stored in the variable named [ThisGameID].  As you can see, I’m using index [0] to indicate the position of this value; if the record set is expected to return more than one column, I could add those in as additional column/variable mappings, using the ordinal position of each column to map to the proper SSIS variable.

variablemapping

With that done, I’ll add an instance of the Data Flow Task to the loop container configured above, which will complete the work on the control flow:

controlflow

Configure the Business Logic in the Data Flow

Now it’s time to dive into the data flow I just created.  Within that data flow, I’ll add a new OLE DB Connection component, the purpose of which will be to retrieve the at-bat statistics for each playoff game. To the output of that source, I will attach an instance of the Flat File Destination, which will be used to send each game’s data to the respective output file.

dataflow

Within the data source, I need to configure the query such that it retrieves data for one and only one game at a time.  Since the current game ID value is stored in the [ThisGameID] SSIS variable, I can simply map that variable as a query parameter, so that each execution of this SELECT query will limit the results to only include statistics for the current game ID.  As shown below, I’m using a parameter placeholder (the question mark in the query) to indicate the use of a parameter:

query

… and when I click the Parameters… button, I can map the SSIS variable containing the game ID to that query parameter:

queryparams

I have already configured an instance of the Flat File Destination (and by extension, set up the Flat File Connection Manager) to allow me to write out the results to a file, but how will I create a separate file per game?  It’s actually quite easy: by using a simple SSIS expression on the ConnectionString property of the Flat File Connection Manager, I can configure the output file name to change on each iteration of the loop by using the game ID value as part of the file name.  As shown below, I’m accessing the Expressions collection within my Flat File Connection manager, overriding the static value of the ConnectionString property with an interpreted value using the amalgamation of two variables – the directory location I specified earlier, along with the current game ID.  Remember that since SSIS variables are evaluated at runtime, the value of the variables can change during execution, thus allowing the use of a single Flat File Connection Manager to write out multiple files during each package execution.

fileconnstr

Finally, when I execute the configured package, I end up with a few dozen output files – one per playoff game.  As a side note, my Texas Rangers were only represented in one of those playoff games from last year.  We’ll get ‘em this year.  As shown below, each output file is distinctified with the game ID as part of the file name.

outputfiles

Conclusion

Use of the SSIS object typed variable can be a very powerful tool, but it need not be complex.  As shown in this example, we can easily leverage the object variable for iteration over a result set without writing a single line of programmatic code.

In the next post in this series, I’ll dig further into object typed SSIS variables, and will explore how to use and manipulate other types of objects not natively represented in SSIS.

Webinar: Scripting and SSIS

Tomorrow at 10am (11am EDT), I’ll be joining together with my good friend and SSIS Design Patterns coauthor Andy Leonard for a free one hour webinar to discuss scripting in SQL Server Integration Services:

Join SQL Server MVP Tim Mitchell and Andy Leonard as they discuss and demonstrate scripting in SSIS! In this demo-packed session, two co-authors of the book SSIS Design Patterns share their experience using the Script Task and Script Component to accomplish difficult transformations and improve data integration.

You can register online here.  We look forward to seeing you tomorrow!

Using Object Typed Variables in SSIS

Note: This will be the first post in a short series on using Object typed variables in SQL Server Integration Services.

When defining variables in SSIS, the ETL developer has several data type options to choose from depending on the information to be stored in each variable.  Included in the options are String, Boolean, Char, DateTime, and several flavors and sizes of Int.  However, there’s another variable data type that is very handy but also underutilized: the Object data type.

fig1

Why Object Variables?

Variables with a data type of Object are the most flexible variables in SSIS.  Data types such as Int, String, and DateType are designed to store just one type of data; however, an Object typed SSIS variable can store almost any type of data, even information that can’t otherwise be represented in Integration Services.  In most cases, SSIS doesn’t even have to be configured to know what type of data you’re storing in an Object typed variable – usually, it can simply pass the value along the wire as a bunch of bits without knowing or caring what’s in there.

When considering the use of Object typed variables in SSIS, I’ll give the same disclaimer that I give for using script tasks/components in SSIS: Just because you can doesn’t mean you should.  If a native data type will work to store any possible value for a particular variable, by all means, don’t complicate your code by adding unnecessary moving parts.  Use Object typed variables only when a native type won’t do – otherwise, stick to the well-worn path.

Common Uses

There are a few cases that come to mind that lend themselves to using Object typed variables in SSIS:

  • Iterating over a result set.  This is probably the most common and well-documented use of Object variable, as well as the easiest to implement as it requires no manual coding.  This pattern will allow you to retrieve a set of data from a database, and then perform some operation for each row in that result set.  I’ll work through the mechanics of how to do this in my next post.
  • Handling binary data.  A common example of this is extracting binary data from or writing binary data to a VARBINARY field in a relational database.  If you need an interim storage mechanism in which this binary data should be stored, and Object variable can be a good solution.  In this case, the ETL pipeline doesn’t necessarily need to be aware of what is stored in the Object typed variable .
  • Creating or consuming binary data in SSIS.  Let’s say you need to either generate or process binary data as part of your ETL.  For example, you might need to retrieve a JPEG or PNG file from the file system and write it into a relational database, or retrieve a binary object from a database and process the various elements of that object.  By storing said data in an Object typed variable, you can directly write to or read from this variable within your code.

Risks and challenges

Naturally, with a construct as flexible as an Object typed variables, there are a few challenges to be aware of when considering when and how to use objects in your SSIS packages.  Among the risks:

  • Some coding required.  In many cases, especially when you’re processing the information contained in the Object typed variable (as opposed to simply passing the value through from a source to a destination), you’re going to have to write some code to address that object.
  • SSIS expressions not allowed.  Because they are designed to store a variety of information structures, Object typed variables cannot be used in an SSIS expression.  Even if the underlying data stored in the variable is of a type that could be stored in a native SSIS type, attempting to add an Object typed variable to an SSIS expression will throw an error.
  • Debugging challenges. If you overflow an Int32, or happen to truncate a string in SSIS, you’ll get a (mostly) friendly error message telling you what went wrong.  Often, when dealing with Object typed variables, you don’t get that luxury in SSIS.

Conclusion

Object typed variables in SQL Server Integration Services allow a great deal of flexibility when dealing with atypical data structures in the ETL cycle.  Although they are not as commonly used as simple native types, Object typed variables can make otherwise difficult ETL tasks easier.

In my experience, I’ve found that Object typed variables are sometime avoided in SSIS packages simply because they’re misunderstood or believed to be too difficult to use.  In the next few posts in this series, I’ll illustrate how the Object variable can be leveraged in your SSIS package by demonstrating a few use cases where Object typed variables in SSIS are appropriate.