Fix Inconsistent Line Terminators in SSIS

There is a flat file processing issue I’ve run into a number of times over the years, and it’s come up again several times recently. The issue relates to the line terminators used in data files. Occasionally, changes to the systems generating these data files, or perhaps even manual edits, can change the way the file marks the end of a line. These changes can cause a failure of package execution, or even worse, they can be loaded successfully and cause data quality issues in the target.

In every text file, there are unprintable characters called line terminators that mark the end of each line. On most UNIX and Linux systems and some older operating systems, files are created using the line feed character (LF, or ASCII character 10), while files generated in Windows typically use the carriage return and line feed (CRLF, or ASCII characters 13 and 10, respectively) to mark line endings. The tricky part is that these characters are generally not displayed, so opening up a data file with Notepad or a similar editor will not present any visual differences between line feed-terminated files and those using carriage return plus line feed. However, these differences can have a significant impact on ETL processes, as well as any downstream systems relying on data from those files.

In this post, I’ll show some examples of both LF and CRLF terminated files, and how they behave in SSIS when the package is expecting one type but gets the other. I’ll then demonstrate two workarounds to prevent unexpected errors due to changing line endings.

The scenario

My client, Contoso, has a new supplier, and I’m setting up the ETL process to import some flat file data from this new supplier. I’m working from a file spec that includes column-level mappings and indicates that lines will be terminated using CRLF. I build the package, test it against sample data provided by Contoso’s new supplier, and send the successfully tested package to QA for final testing and deployment to production.

Weeks go by, and the package works great. However, one morning I get a call from Contoso, asking for advice in troubleshooting this new process. It appears that the package has failed without loading any data, logging a data truncation issue upon failure. Both Contoso and their new supplier have reviewed the data file causing the failure, and cannot find any reason for the error. I open the file up in Notepad++ and turn on the Show Line Endings feature, and the problem becomes readily apparent. The most recently successful file looks like this:

image

However, the failed file looks like this:

image

The difference is subtle but important: The second file uses the line feed character as a terminator, while the previous file uses a carriage return. This distinction is not visible when using tools such as Notepad, and in fact, even in Notepad++, these characters aren’t shown by default. However, even though these characters are not visible by default, the distinction is very important.

Why does it matter?

Although they are easy to forget about, incorrect line endings can wreck an ETL process. As shown in the hypothetical scenario above, in cases where the SSIS package expects to receive CRLF line endings but instead gets just an LF, most likely the package will fail due to either data truncation or data type issues. Even worse, if the package is set up to process LF line endings but receives a file with CRLF terminators, chances are good that the data will actually be loaded – with some extra baggage. In the latter case, if the last data field on the line is interpreted as a character data type (CHAR, NVARCHAR, etc.), the carriage return character would be preserved in the loaded data. In the example below, I’ll show how this can impact the quality of that data.

For this example, I’ve created an SSIS package to process a data file using LF line terminators. Then, I regenerate the same data file using CRLF line endings, and process the modified file. The package successfully loads the file, with no apparent problems. Below I can see in my target table that the data has been loaded.

image

Now, I want to find all products matching the first ItemName in the list. When I query this table using the exact ItemName value I just saw in my SSMS results window, I find that I get no results at all.

image

Even though I’m typing in the exact description I see, I get no results for that value. The reason is that I’m looking for the literal string ‘Mountain-100 Black, 42’, when in reality, the value in this field contains an unseen carriage return. Because the SSIS connection was configured to use LF as the line ending, it interprets the carriage return to be part of the data, and loads it to the output table. Copying that value from the SSMS results grid and pasting it into the query window confirms that the extra CR character is present at the end of the data value. Knowing this, I can modify the section criteria I used, changing the query from an exact match to a LIKE with a wildcard at the end to return the values I expected to see.

image

This confirms that the line ending is the problem, but what can be done to avoid this in the first place?

Fixing the Line Ending Problem

When coding to avoid issues with inconsistent line endings, there are three potential scenarios to plan for:

  • Lines with LF line terminators
  • Lines with CRLF line terminators
  • Lines with CR line terminators (a far less common scenario)

Planning for the first two scenarios listed above is relatively easy; the last one takes a bit of work. I’ll demonstrate the design patterns for handling each of these.

Coding for LF and CRLF

As I mentioned earlier, files originally specified as LF endings then getting switched to CRLF (or vice versa) is more common that you might think. However, this problem is fairly easy to resolve using the SSIS data flow. First, the flat file source should be updated to use only a line feed for the line terminator, as shown below.

image

Next, on the data flow, add a derived column transformation to the data pipeline. This transformation will remove any carriage return values (indicated by “\r” in the SSIS expression) found in the last data field.

image

When using this pattern, the output will be the same regardless of whether the lines in the data file are terminated with LF or CRLF. For the latter, the package will simply remove the extra carriage return in the data flow. This is a very easy pattern to implement, and will provide protection against line endings changing from LF to CRLF, or vice versa.

Coding for CR, LF, or CRLF

Building a package to handle any type of line ending – CR, LF, or CRLF – takes a bit more work. Since the SSIS flat file connection manager must be configured for the type of line ending to expect, preparing for line endings that are not known at design time requires a more versatile source: the script component. Using the System.IO namespace in the script component, I can open the file, read through each line, and parse out the values irrespective of the line endings used.

In this example, I’ve added a new script component to the data flow, and I have configured this as a source. Next, I added output columns to the default output on the script component, which match the metadata in the table to which we will be loading the data. Finally, I wrote the code below which will read each line in the file, assigning the values to their respective columns in the output.

public override void CreateNewOutputRows()
    {
        // Create a connection to the file
        StreamReader reader = new StreamReader(Connections.SalesFile.ConnectionString);

        // Skip header row
        reader.ReadLine();

        while (!reader.EndOfStream)
        {
            string line = reader.ReadLine();
            string[] columns = line.Split(Variables.vDelimiter.ToCharArray());

            // Use an increasing integer for indexing the columns below (so I can be lazy and paste below).
            int i = 0;

            // Add a new row to the output for each line in the file
            Output0Buffer.AddRow();

            // Assign the appropriate values into the output columns
            Output0Buffer.SalesOrderID = int.Parse(columns[i++]);
            Output0Buffer.SalesOrderDetailID = int.Parse(columns[i++]);
            Output0Buffer.CarrierTrackingNumber = columns[i++];
            Output0Buffer.OrderQty = sbyte.Parse(columns[i++]);
            Output0Buffer.ProductID = short.Parse(columns[i++]);
            Output0Buffer.SpecialOfferID = sbyte.Parse(columns[i++]);
            Output0Buffer.UnitPrice = float.Parse(columns[i++]);
            Output0Buffer.UnitPriceDiscount = float.Parse(columns[i++]);
            Output0Buffer.LineTotal = float.Parse(columns[i++]);
            Output0Buffer.rowguid = columns[i++];
            Output0Buffer.ModifiedDate = DateTime.Parse(columns[i++]);
            Output0Buffer.ItemName = columns[i++];
        }
    }

The reason this works in the C# code above and not in the flat file source is that C# treats lines within files a bit differently than the SSIS connection manager does. The flat file connection manager in SSIS has to be configured for a specific type of line ending, while the StreamReader.ReadLine() function simply reads to the end of the line irrespective of the line ending used.

Again, it’s been rare that I’ve had to code for the possibility of three different line ending types. However, I have seen this occur a few times, and for volatile data files, it’s a design pattern worth considering.

Conclusion

In the same way data professionals are usually skeptical of untested data, they must also be mindful of suspect metadata. Changes that appear to be as benign as a modification to the line terminator characters can have a serious negative impact on ETL and its target systems. Using the defensive ETL strategies I’ve shown here, you can help prevent errors or data quality issues related to changing line endings in data files.

Iterating Through Excel Worksheets with Biml

bimlWrangling large or complex Excel workbooks in SSIS can be a challenge. From managing data types (more about that in this post by Koen Verbeeck) to addressing multiple worksheets in a single document, configuring SSIS to properly read from or write to Excel documents is tedious at best. While there are no silver bullets to completely solve these problems, I’ve found that – like many other challenges in the SSIS world – using Biml can help eliminate some of the manual work involved.

In this post, I’ll demonstrate how you can use a few lines of BimlScript code to read through multiple worksheets in an Excel source.

The Challenge

For this scenario, I’m using a single Excel 2007 document containing numerous worksheets. All of the worksheets have a similar structure, each containing results from a different geographic area.

SNAGHTMLb27785d

What I want to accomplish is to programmatically infer from the Excel document the name and structure from each worksheet, and create an SSIS data flow for each one. Doing so can save a great deal of effort, especially if there are many columns, many worksheets, or if the structure of each worksheet can differ from the others.

The Biml Solution

The solution, using BimlScript, is designed as follows:

  • Connect to the source Excel file and loop through the worksheets
  • Create an SSIS data flow for each worksheet
  • In each data flow, create an Excel source component to connect to the worksheet specific to that data flow
  • In each data flow, create an OleDB destination component, and automatically map source columns from the Excel worksheet to the destination table

First, connect to the Biml file as shown below. Note that, if you haven’t already, you will need to install the Excel 2007 driver for this code to work.

  <#@ import namespace=”System.Data” #>
<#@ import namespace=”System.Data.OleDb” #>
<#
var connectionString = “Provider=Microsoft.ACE.OLEDB.12.0;Data Source=E:\\AccidentData.xlsx;Extended Properties=\”EXCEL 12.0 XML;HDR=YES;IMEX=1\”;”;
var connection = new OleDbConnection(connectionString);
connection.Open();
var worksheetCollection = connection.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null).Rows.OfType<DataRow>().Select(i => i[“TABLE_NAME”].ToString()).Where(i => i.EndsWith(“$”));
#>

This will create an array named worksheetCollection, which contains the names of each worksheet found in the specified Excel file. (Big thanks to Scott Currie of Varigence for sharing the syntax of this operation.)

Next up, I will iterate through this list of worksheets, building an SSIS data flow for each one. I’ll also point out that I could also create one package per workbook, as opposed to a single package with multiple data flows; however, for the sake of simplicity in demonstration, I’m using just one package with a separate data flow per Excel worksheet. As shown below, I’m creating a foreach loop in BimlScript, to loop through each worksheet name. In each iteration of the loop, I’ve created a data flow, with the same name as the worksheet.

<# foreach (var worksheet in worksheetCollection) {#>
<Dataflow Name=”<#= worksheet.Replace(“$”,  “”) #>” DelayValidation=”true”>
</Dataflow>
<# } #>

So far this data flow task does nothing. Let’s add the Excel source connection.

<# foreach (var worksheet in worksheetCollection) {#>
<Dataflow Name=”<#= worksheet.Replace(“$”,  “”) #>” DelayValidation=”true”>
<Transformations>
<ExcelSource ConnectionName=”Accidents Source File” Name=”XLS Accidents – <#= worksheet.Replace(“$”,  “”) #>”>
<DirectInput>SELECT * FROM `<#= worksheet #>`</DirectInput>
</ExcelSource>
</Transformations>
        </Dataflow>
<# } #>

The code above creates a connection to the Excel document, building the SELECT statement using the name of the worksheet as the table name. A couple of things to note here: First, this snippet uses an Excel connection named Accidents Source File, which I set up in a previous step (the code for which is provided in the download). Also, the source name uses the name of the worksheet with the dollar sign ($) removed through the Replace() function in the BimlScript.

Finally, I’ll add the destination connection. Since all of the workbooks in this sample spreadsheet will be loaded to the same table, there is no dynamic code in the destination component. You may notice that this snippet also does not use any column-level mappings. This is by design; since each of the worksheets may have some differences in metadata (an extra column, or missing columns), I’ve excluded any static column mappings, instead relying on automatic column name mapping in Biml. Also, like the source transformation, this destination refers to a source name that was set up previously.

<# foreach (var worksheet in worksheetCollection) {#>
<Dataflow Name=”<#= worksheet.Replace(“$”,  “”) #>” DelayValidation=”true”>
<Transformations>
<ExcelSource ConnectionName=”Accidents Source File” Name=”XLS Accidents – <#= worksheet.Replace(“$”,  “”) #>”>
<DirectInput>SELECT * FROM `<#= worksheet #>`</DirectInput>
</ExcelSource>
<OleDbDestination Name=”OLEDST Accidents Table” ConnectionName=”AccidentData DB”>
<ExternalTableOutput Table=”[dbo].[AccidentData]”></ExternalTableOutput>
</OleDbDestination>
          </Transformations>
</Dataflow>
<# } #>

After saving the Biml file and generating packages, the resulting SSIS package appears as below.

image

Each data flow contains an Excel source connected to the common output table. Each of these Excel sources retrieves data from one worksheet, filtered by the SELECT query.

SNAGHTMLbae36df

The complete Biml file used for this example can be found here.

Conclusion

Working with Excel data in SSIS can be tedious and time-consuming. However, Biml can help by automating some of the manual operations involved in Excel-to-SSIS mappings.

SSIS Parent-Child Architecture in Catalog Deployment Mode

This is the third in a series of posts about SSIS parent-child architecture.  You can find the index page here.

In my previous posts on SSIS parent-child package architecture, I described the benefits of the parent-child package design pattern and demonstrated the implementation of such a pattern in package deployment mode.  In this post, I will demonstrate the parent-child pattern for implementations using catalog deployment mode, which is the default design on SSIS 2012 and 2014.

Catalog deployment mode in SSIS

If you’re reading this post and find yourself asking, “What is catalog deployment mode?”, here it is in a nutshell: Starting with SQL Server 2012, there were significant changes to the architecture of SSIS, most notably the move to a deployment/storage structure called catalog deployment model (which is also frequently referred to as the project deployment model).  In this model, SSIS code is more project-centric than package-centric; packages are deployed as an entire project instead of individually (though each package can still be executed independently).  Catalog deployment mode in SSIS also brought the addition of parameters, which can be used to externally set runtime values for package executions, as well as project connections, which can be used to easily share connection settings across all packages in a project.  Many other changes were introduced, including a simpler logging model and a dedicated SSIS database.

Among the many changes brought about by the catalog deployment model, the one that had the most impact on the parent-child design pattern is the addition of parameters.  In older versions of SSIS, it was possible to pass runtime values to a package, but the process was clunky at best.  When using SSIS 2012 or 2014 in catalog deployment mode, setting runtime values for a child package (or a standalone package, for that matter) is much easier and more straightforward than performing the same task in previous versions.

It is also worth noting that you don’t have to use the catalog deployment model in recent versions of SSIS.  Although catalog deployment model is the default setting in SSIS 2012 and 2014, you can set your project to use the package deployment model.  You’ll lose many of the new features (including parameterization and simplified logging) by choosing package deployment model, but this might be practical if your organization has made a significant investment in SSIS architecture that would be broken by moving to catalog deployment model.

Parent-child package execution

At the heart of parent-child architecture is the collection of parameters.  In catalog deployment mode, we can set up parameters at the package level or at the project level.  For values that would affect just a single package, using a package parameter would make the most sense.  However, if a value might need to be shared among several (or all) packages in a particular project, a project parameter would allow you to create the parameter once for the entire project rather than one for each package.

Execute package task

When executing a child package, the simplest method is still the execute package task.  Introduced in 2012, the execute package task now has a dropdown list (shown below, on the Package tab) to allow the SSIS developer to specify the target package.

image

There are a few limitations with this approach.  Most notably, this dropdown list selection only works when calling a package that exists in the same project.  You’ll notice that the selection above the package name, entitled ReferenceType, is set to Project Reference.   Though you can change ReferenceType to use a project located elsewhere, oddly enough you can’t use it to execute a package in a different project deployed to the SSIS catalog (you can read more about that limitation, as well as upvote the issue on Connect here).  I’ll discuss a couple of workarounds for this momentarily.

Clicking over to the Parameter bindings tab, we can specify which values to pass into the child package.  For each child package parameter, we specify exactly one value to be supplied at runtime.  Remember, like the dropdown list for package selection, this piece only works when executing packages in the same project (using the Project Reference setting on the ReferenceType from the Package tab).

image

Keep in mind that you have to use a parameter or variable (again, choosing from the dropdown list) to map to the child parameter.  You can’t simply type in a static value in the Binding parameter or value field.  Also, remember that you will only see package parameters (not project parameters) in the list of child package parameters that may be mapped.  This is by design – it wouldn’t make sense to map a value to a project parameter when executing a package in the same project, since that child package would already implicitly have access to all of the project parameters.

Another distinct advantage of using the execute package task is the process for handling errors in the child package.  In the event that a child package fails, the execute package task will fail as well.  This is a good thing, because if the child package does fail, in almost all cases we would want the parent package to fail to prevent dependent tasks from improperly executing.  Even better, error messages from the child package would be bubbled up to the parent package, allowing you to collect error messages from all child packages within the parent package.  Consolidated error handling and logging means less development time upfront, and less maintenance effort down the road.

If you have the option of using the execute package task for starting packages stored in the SSIS catalog, I recommend sticking with this method.

Execute SQL task

Another method for executing one package from another is by using the T-SQL stored procedures in the SSIS catalog itself.  Executing a package in the SSIS catalog in T-SQL is actually a 3-step process:

  • Create the execution entry in the catalog
  • Add in any parameter values
  • Execute the package

Catalog package execution via T-SQL, another new addition in SSIS 2012, allows us to overcome the limitation in the execute package task I mentioned earlier.  Using a T-SQL command (via the execute SQL task in SSIS), we can execute a package in any project.  It’s certainly more difficult to do so, because we lose the convenience of having the list of available packages and parameters exposed in a dropdown list in the GUI.  Here there be typing.  However, being able to execute packages in other projects – and for that matter, on other SSIS catalog servers entirely – somewhat makes up for the manual nature of this method.

To execute a child package using this method, you’d create an execute SQL task and drop in the appropriate commands, which might look something like the following:

DECLARE @execution_id BIGINT

EXEC [SSISDB].[catalog].[create_execution] @package_name = N'ChildPkgRemote.dtsx'
	,@execution_id = @execution_id OUTPUT
	,@folder_name = N'SSIS Parent-Child'
	,@project_name = N'SSIS Parent-Child Catalog Deployment - Child'
	,@use32bitruntime = False
	,@reference_id = NULL

-- Set user parameter value for filename
DECLARE @filename SQL_VARIANT = N'E:\Dropbox\Presentations\_sampleData\USA_small1.txt'

EXEC [SSISDB].[catalog].[set_execution_parameter_value] @execution_id
	,@object_type = 30
	,@parameter_name = N'pSourceFileName'
	,@parameter_value = @filename

-- Set execution parameter for logging level
DECLARE @loggingLevel SMALLINT = 1

EXEC [SSISDB].[catalog].[set_execution_parameter_value] @execution_id
	,@object_type = 50
	,@parameter_name = N'LOGGING_LEVEL'
	,@parameter_value = @loggingLevel

-- Set execution parameter for synchronized
DECLARE @synchronous SMALLINT = 1

EXEC [SSISDB].[catalog].[set_execution_parameter_value] @execution_id
	,@object_type = 50
	,@parameter_name = N'SYNCHRONIZED'
	,@parameter_value = @synchronous

-- Now execute the package
EXEC [SSISDB].[catalog].[start_execution] @execution_id

-- Show status
SELECT [status] AS [execution_status]
FROM SSISDB.CATALOG.executions
WHERE execution_id = @execution_id

Two things in particular I want to point out here.  First of all, by default when executing a package using T-SQL, the package is started asynchronously.  This means that when you call the stored procedure [SSISDB].[catalog].[start_execution], the T-SQL command will return immediately (assuming you passed in a valid package name and parameters), giving no indication of either success or failure.  That’s why, on this example, I’m setting the execution parameter named SYNCHRONIZED to force the T-SQL command to wait until the package has completed execution before returning.  (Note: For additional information about execution parameters, check out this post by Phil Brammer).  Second, regardless of whether you set the SYNCHRONIZED parameter, the T-SQL command will not return an error even if the package fails.  I’ve added the last query in this example, which will return the execution ID as well as the execution status.  I can use this to check the execution status of the child package before starting any subsequent dependent tasks.

image

As shown, I’ve set the SQLStatement value to the T-SQL code block I listed above, and set the ResultSet value to Single row, the latter of which will allow me to capture the output status of the executed package.  Below, I’ve set that execution status value to a new package variable.

image

To round out this design pattern, I set up my control flow as shown below.  Using precedence constraints coupled with SSIS expressions, I execute the package and then check the return value: a successful catalog execution returns a value of 7, and my parent package handles any return value other than a 7 as a failure.

SNAGHTML2fb7ad03

You may also have to give special consideration for errors in child packages when using T-SQL for package execution – especially when running packages interactively in the BIDS/SSDT designer.  Since the T-SQL command does not report the failure of a package by default, it also doesn’t “bubble up” errors in the traditional SSIS manner.  Therefore, you’ll need to rely on capturing any child package error messages from the SSIS catalog logging tables, especially when developing and testing packages in Visual Studio.

Script Task

It is also possible to execute SSIS packages programmatically from the script task.  This method is significantly more complicated, but also offers a great deal of flexibility.  A fellow SQL Server MVP and acquaintance of mine, Joost van Rossum, has a detailed blog post on how to programmatically execute SSIS packages from the SSIS script task.  I won’t restate what he has already covered in his comprehensive and well-written post on the topic, but if you need to programmatically fire child packages in the SSIS catalog, check out his write-up.

Conclusion

In this post, I’ve covered the essentials of executing child packages in the SSIS catalog, including provisioning for errors in the child package.  Though it can be quirky, especially when executing child packages in a different project or on a different server, there are several different ways to address this parent-child design pattern.

In my next post, I’ll talk a little more about passing values in a parent-child package, and will illustrate how to pass values from child packages back to the parent package.

Resolving SSDT-BI Installation with IsVisualStudio2012ProInstalled() error

Recently I was building a new virtual machine for presentations, and loaded up my usual battery – Windows 7, SQL Server 2012, and SSDT-BI.  The installation of these pieces went as expected, and everything appeared to work well – until I attempted to execute an SSIS package on this new setup.  When I did so, I received the following cryptic error message:

image

Thinking that perhaps I had gotten a bad download for SSDT-BI, I started fresh again, downloading another copy of the bits and working through the machine build process, only to encounter the same error message again.

Fortunately, after just a bit of digging, I found that this is an issue that others have found when going through the same installation process.  I found this thread that points out that the issue occurs because one of the assemblies used by the SSIS designer has failed to properly register.  The solution is to re-register the assembly in question.  The exact steps to do so may vary depending on your OS version, the version of SQL Server, etc.  For my setup (Win7 x64, SQL Server 2012 Developer Edition w/ SP2, and SSDT-BI for Visual Studio 2012), the steps I used to correct the issue are below:

  • Open a command prompt window as administrator (right click the shortcut, and click Run As Administrator)
  • Change directories to C:\Program Files (x86)\Microsoft SDKs\Windows\v8.0a\bin\NETFX 4.0 Tools\
  • Execute the following command: gacutil.exe /if “C:\Program Files (x86)\Microsoft Visual Studio 11.0\Common7\IDE\PrivateAssemblies\Microsoft.SqlServer.Dts.Design.dll”

If the registration was successful, you’ll receive a confirmation message similar to below.

image

Thanks to TechNet user hazel_m_stu_o for the response on the thread mentioned above.

SSIS Parent-Child Architecture in Package Deployment Mode

This is the second in a series of posts about SSIS parent-child architecture.  You can find the index page here.

In my first post in this series, I covered the essentials of SSIS parent-child design patterns, including discussing a number of advantages for using such patterns.  In this post, I’m going to demonstrate how to build parent-child structures using package deployment mode.  This deployment mode is available (though not the default behavior) in SSIS 2012 and 2014, and is the only deployment mode available in SSIS 2008 and earlier.  You might wonder why I might cover this deployment mode, since it’s no longer the preferred method for SSIS 2012 and 2014, and older versions that require package deployment mode are at least six years and two versions old.  I share this information for two main reasons.  First of all, even though package deployment mode is less commonly used than project deployment mode, there are many organizations still using it – even in recent versions of SSIS – due to legacy packages that were imported from older versions.  Additionally, it seems everywhere I go, folks are still running mission-critical packages on older versions of SSIS (especially 2008), and when I demonstrate features in SSIS 2012 or 2014, I’m frequently asked about the equivalent behavior in those older versions.

Under the package deployment model, there are generally three ways to execute one package from another package:

  • Execute Package Task – This is the built-in tool in SSIS that allows the developer to execute a package from another package
  • dtexec.exe – This is the command-line tool for executing packages
  • dtexecui.exe – This is the user-friendly version of dtexec, with a full UI for executing packages

Each of these three methods can be used to execute packages stored on either the file system or in MSDB.  Because this post addresses parent-child structures, which typically involves packages executing other packages, I’ll focus on the execute package task.

Execute Package Task

Using the execute package task in package deployment mode is a relatively straightforward process; simply drag the execute package task onto the control flow surface and set the configuration options.  As shown in the screenshot below, executing a package stored in MSDB requires only a connection to the SQL Server instance and the full path of the package to be executed.  Optionally, you can also specify the password (if the child package to be executed is password protected) and choose to execute the package in a separate process.

image

Similar options are available for a package deployed to the file system.  Note that you must create a file connection to connect to the package stored in the file system as shown below.

image

Under the Expressions tab (shown below), you can optionally configure dynamic values using SSIS expression language.  For example, you could add some logic that would dynamically specify the name of the package, which can be useful when building a work pile pattern (which I will cover in a future post).

image

Passing Values

So far, all straightforward, with very little to configure, right?  For a simple parent-child setup, this may be all you need to do.  But what if you want to make this arrangement more flexible – for example, you need to add some values to pass from the parent package to the child package.  A quick review of the execute process task reveals exactly zero hints on how to accomplish this.  By design, the parent package does not explicitly pass values down to child packages; rather, the child package declares – through package configurations – which variables it will consume from the parent.  So, for sharing variables from parent to child, the task list will be:

  • In the parent package, create and populate the variables to be shared with the child package
  • In the child package, create variables that will be populated by values from the parent
  • In the child package, create package configurations – one per variable – to populate the child variables from those in the parent

Creating the variables is very simple.  Start with the parent package, creating a couple of variables – named vIntVar and vStringVar – and setting their values, as shown below.

image

Next, in the child package (shown below), create two corresponding parameters which will be used to store the variable values from the parent package.  Note that in this example I have not assigned a default value to the String parameter, since the parent package will be supplying that value at runtime.  For this example, you’d still have to supply a default value for the Integer (Int32) parameter, because this data type requires a non-NULL default value.

image

That’s the easy part.  The more obscure and difficult part is adding in the configurations for these variables. When working in package deployment mode in SSIS 2014 or 2012 (or on any package on SSIS 2008 or 2005), package configurations are used to allow child packages to consume parameter values from the parent package.  The package configurations, defined on the child packages, specify which values are to be received from the parent package, and which variables those values should be written into.   To create the package configuration in the child package, right click on an empty space in the control flow and select Package Configurations.  You’ll need to click the checkbox to Enable package configurations, if you haven’t previously used configurations in this package.  The empty Package Configurations Organizer window is shown below.

image

To create a new package configuration, click the Add… button to open the Package Configuration Wizard window.  On the Configuration type setting, choose Parent package variable, and type in the name of the first variable to configure (vIntVar) as shown below.

image

Wait – I have to type in that variable name?  Why not have a drop-down list to let me choose it?  That’s an excellent question, and one that I too asked when I first started with package configurations.  The reason you have to type in the variable name is that the child package has no explicit relationship to the parent package.  Theoretically, this child package could be executed from any parent package – it’s not bound to any particular parent.

Once you’ve typed in the name of the parent package variable, click Next > and go to the next configuration page.  Here, you specify where to put this value from the parent package variable.  As shown below, you will use the package tree view to specify the variable to which you’re going to write this value.  Drill into the variable name, expand the Properties beneath that variable, and click on Value to select it as the target of this configuration assignment.

Just as a side note, if you drill into the folders below the list of variables, you can see that you don’t necessarily have to use an SSIS variable in the child package to capture the value from the parent package; you can write directly to other properties of the package (such as connection strings).  For a variety of reasons – most notably, easier troubleshooting – I recommend that you always write a value from a parent package configuration into an SSIS variable in your child package.  You can always assign that child package variable value to a built-in value – for example, a connection string – using an expression elsewhere in the package.

image

After selecting Variables –> vIntVar –> Properties –> Value, click Next > and supply a name for this configuration as shown below.  To keep things simple for this example, I’ve simply named the configuration with the same name as the variable (though you can give it a different name, if you prefer).

image

For the other variable (vStringVar), you can repeat this process, mapping the variable of that name in the parent to the variable of the same name in the child package.

With both configurations in place, the package is ready to be run for testing.  On the child package, I’ve created a script task that will open a Windows message box to show the variable values to confirm that they are coming from the parent package.  As shown below, the message box window from the script task in the child package confirms that the child package parameter values are supplied by the parent package.

image

As I discussed in my previous post, you now have a simple package infrastructure that lends it self to easier development and troubleshooting, streamlined error handling, and less code repetition.  It’s difficult to see with just two packages, but with an ETL system with dozens or hundreds of packages, the advantages of these parent-child patterns quickly become obvious.

Parting Thoughts

A few things to keep in mind when using the execute package task in package deployment mode:

  • The parent package variable and child package variables do not have to have the same names.  I kept the same names in this example for clarity, but there’s no technical requirement that you name those the same.  The binding between parent package variable and child package variable is established in the configuration, not by virtue of naming.
  • If you create a parent package variable configuration for a variable that does not exist in the parent package, the package execution will not fail for that reason; it will simply leave intact the default value (if any) for that child package variable.  Other failures might occur related to that missing parent package variable (for example, a missing connection string if you are using configurations to pass connection strings), but a missing parent package variable alone won’t cause a package failure.
  • The transfer of values from parent to child is a one-way transfer.  The information is passed by value, meaning that a copy of the value (not a reference to the original value) is given from parent to child.  Therefore, if you modify the value of the child package variable that was originally loaded from a parent package configuration, it has no impact on the original value in the parent package.  I will demonstrate in a later post in this series how to pass values by reference from the child package back to the parent package.

Conclusion

In this post, I’ve briefly demonstrated a simple pattern for implementing a parent-child architecture using the package deployment mode.  In my next post, I’ll go into depth on using this architecture in the project deployment mode in SSIS 2012 and 2014.

SQL Saturday Lisbon

sqlsatportIt’s a little over a week until this year’s SQL Saturday festivities kick off in Lisbon, Portugal, and I’m very excited to be a part of it.  Registration is nearly full, so if you’re in the area and are planning on attending, register now!

For this event, I’m delivering a full day workshop entitled “Real World SSIS: A Survival Guide”, during which I’ll share design patterns and practical lessons I’ve learned over my 10-ish years in the BI/ETL space.  This workshop will be held on Thursday, April 10th (the Thursday prior to the main SQL Saturday event), and there are still some seats available.  You can register for this workshop online.  I’ve also recorded a teaser video of what’s to come in this workshop.

In addition to the full-day workshop on Thursday, I’ll also be presenting two, one-hour sessions on Saturday.  I’ll be sharing “Handling Errors and Data Anomalies in SSIS” and “15 Quick Tips for SSIS Performance” during the regular SQL Saturday event.

If you plan on attending SQL Saturday in Lisbon, please stop by and say hello!  I’m looking forward to seeing you there.

Parent-Child SSIS Architecture

This is the first in a series of technical posts on using parent-child architectures in SQL Server Integration Services.  The index page for all posts can be found here.

In this post, I will provide an overview of the architecture and describe the benefits of implementing a parent-child design pattern in SSIS structures.

Definition

The simplest definition of SSIS parent-child architecture is that it consists of packages executing other packages.  In SSIS, the package is the base executable; it is the most granular component that can be executed independently1.  Every version of SSIS includes the ability for one package to execute another package through the use of one of the following:

  • The Execute Package Task
  • T-SQL commands
  • The Execute Process Task (using the dtexec.exe utility)

Though the reference to parent-child architecture implies that there are exactly two layers to this design, that does not have to be the case.  You may have a design where a package executes a package which executes a package, and so forth.  Although there may be a hard limit to how deeply nested a parent-child architecture may go, I have never encountered such a limitation.  I have found it useful on a few occasions to go deeper than two levels in this type of architecture, particularly when designing a formal ETL framework (to be discussed further in a future post in this series).  In cases where greater than two levels exist, finding the right terminology for those layers is important.  You can refer to them by patriarchy (grandparent/parent/child) or by cardinality (level 1, level 2, level n), as long as you remain consistent – especially in your documentation – with those references.

Conceptually, a parent-child architecture is a form of code abstraction.  By encapsulating ETL actions into discrete units of work (packages), we’re creating a network of moving parts that can be developed, tested, and executed independently or as part of a larger collection.

Benefits

As I mentioned in my introductory post, there are several benefits to using parent-child structures in SSIS.

Reusability.  In any ETL environment of significant size or complexity, it’s quite normal to discover common ETL behaviors that are reusable across different implementations.  For a concrete example of this: In my spare time, I’m working on an ETL application that downloads XML files from a Major League Baseball web service.  There are files of various formats, and each file format is processed a different way, but with respect to the download of the files, I always perform the same set of operations: create a log entry for the file; attempt to download the file to the local server; log the result (success or failure) of the download operation; if the download has failed, set the HasErrors variable on the main package.  If I were to load this behavior into a group of tasks in the package for each XML format, I’d have five different copies of the same logic.  However, by building a parameterized child package that performs all of these core functions, I only have to build the file download/logging logic once, and execute the resulting package with the appropriate parameters each time I need to download a file.

Easier development.  Working with large and complex SSIS packages can be a pain.  The larger the SSIS packages, the longer it takes for the BIDS or SSDT environment to do its validation checks when the package is opened or modified.  Further, when multiple ETL developers are working on the same project, it is much easier to break apart the project into discrete units of work when using numerous smaller SSIS packages.

Easier testing and debugging.  When working through the test and debug cycles during and after initial development, it’s almost always easier to test and debug smaller packages.  To test a single task that resides in a large SSIS package would require either running the task by itself manually in the Visual Studio designer, or disabling all of the other tasks and redeploying the package.  When working with packages that each perform one unit of work, one can often simply execute the package to be tested through the normal scheduling/execution mechanism.

Clarity of purpose. An architecture that uses small, single-operation packages lends itself to clarity of purpose by virtue of naming.  When browsing a list of deployed packages, it is much more clear to see package names such as “Load Customers Table”, “Merge Product Table”, and “Remove Duplicates in Vehicle Table” than to find do-everything packages with names like “Load Production DB”, “Update DW”, etc.

Performance. In some cases, breaking out multi-step SSIS package can bring some performance gains.  One distinct case that comes to mind is using a distributed architecture, where packages within a single execution group are executed on multiple servers.  By distributing packages across different SQL Server machines (either physical or virtual), it may be possible to improve performance in cases where the processing load on a single SSIS server has become a bottleneck.  I want to emphasize that using a parent-child architecture does not arbitrarily improve performance, so this should not be used as a silver bullet to improve a poorly performing group of packages.

The Tools

As I mentioned earlier, there are three tools that can be used to execute a package from within another package.

The execute package task.  This is the easiest and most common means of executing a package from within another.  This task can trigger the execution of a package stored on the file system, deployed to MSDB or the SSIS catalog, or residing in the same project.  If using SSIS 2012 with catalog deployment mode, you can also use the execute package task to pass parameter values from the parent package to the child package.  It is important to note that the execute package task behaves differently in SSIS 2012 than it does in older versions.

T-SQL commands (via the execute SQL task).  For SSIS projects using project deployment model in SSIS 2012, the built-in stored procedures in the SSIS catalog can be used to execute packages.  This method for executing packages, like the execute package task, allows you to specify runtime parameters via T-SQL code.  One significant advantage of using T-SQL commands to execute packages is that, unlike the execute package task, you can use expressions to set at runtime the name of the package to be executed.  This is useful in cases where you are iterating over a list of packages that may not be known at runtime, such as a pattern found in ETL frameworks.

dtexec.exe (via the execute process task).  Using this method allows you to trigger package execution via the command-line application dtexec.exe.  Although this method is typically used to execute packages in a standalone environment – for example, when using third-party scheduling tools to orchestrate package execution – but dtexec can also be used within SSIS by way of the execute process task.  As an aside, I rarely use dtexec to execute child packages – in most cases, it’s easier to use either the execute package task or T-SQL commands to execute one package from within another.

I’ll also briefly mention dtexecui.exe.  This is a graphical tool that serves the same purpose as dtexec.exe, except that the former exposes functionality via a graphical user interface rather than forcing the user to use command-line parameters for configuration.  Except for this brief mention, I’ll not cover dtexecui.exe in this discussion of parent-child architecture, as that tool is intended for interactive (manual) execution of packages and is not a suitable tool for executing one package from within another.

Parent-Child architecture in the real world

To illustrate how this can work, let’s model out a realistic example.  Imagine that we have charge over the development of a sizeable healthcare database.  In addition to our production data, we’ve got multiple environments – test, development, and training – to support the development life cycle and education needs.  As is typical for these types of environments, these databases need to be refreshed from the production database from time to time.

The refresh processes for each of these environments will look similar to the others.  In each of them, we will extract any necessary data for that environment, retrieve and restore the backup from production, and import the previously extracted data back into that environment.  Since we are dealing with sensitive healthcare data, the information in the training database needs to be sufficiently anonymized to avoid an inappropriate disclosure of data.  In addition, our test database needs to be loaded with some test cases to facilitate testing for potential vulnerabilities.  Even though there are some differences in the way each environment is refreshed, there are several points of shared – and duplicate – behavior, as shown below (with the duplicates in blue).

ParentChild-Duplicate

Instead of using duplicate static elements, we can eliminate some code redundancy and maintenance overhead by encapsulating those shared behavior into their own container – specifically, a parameterized package.  In doing so, we can avoid having multiple points of administration when (not if) we need to make adjustments to those common elements of the refresh process.  The updated architecture uses parameters (or package configurations, if using package deployment mode in SSIS 2012 or any older version of SSIS) to pass in the name of the database environment to refresh.

ParentChild-HighLevel

As shown, we’ve moved those shared behaviors into a separate package (RefreshDB), the behavior of which is driven by the parameters passed into it.  The duplicate code is gone.  We now have just one SSIS package, instead of three, that need to be altered when those common behaviors change.  Further, we can individually test and debug the child package containing those common behaviors, without the additional environment-specific operations.

Note that we haven’t reduced the number of packages using this architecture.  The goal isn’t fewer packages.  We’re aiming for a modularized, easy-to-maintain design, which typically results in a larger number of packages that each perform just a few (and sometimes just one) functions.  In fact, in the parent-child architecture shown above, we could even further refine this pattern by breaking out the individual operations in the RefreshDB package into packages of their own, which would be practical for cases in which those tasks might be executed apart from the others.

Exceptions to the rule

Are there cases in which parent-child structures do not add value?  Certainly.  A prime example of such a case is a small, simple package developed for a single execution with no expectation that its logic will be reused.  I call these throwaway packages.  Because of their single-use nature, there is likely little value in going through the effort to building a parent-child architecture around their business logic.

Up Next

In my next post in this series, I’ll work through the mechanics of using a parent-child pattern in SSIS 2005 or SSIS 2008.

1 Technically, there are lower-level items in the SSIS infrastructure that can be executed independently.  For example, from the BIDS or SSDT design surface, one can manually execute a single task or container within a package.  However, when deploying or scheduling the execution of some ETL behavior, the package is the lowest level of granularity that can be addressed.

Edit: Corrected typo on one of the graphics.

SSIS and PowerPivot training in Baton Rouge

I’m happy to announce that I’ll be teaming up with my Linchpin People colleague Bill Pearson for a day of BI and SSIS training next month.  On Wednesday, February 12th, we’ll each be delivering a full-day presentation in Baton Rouge, after which we’ll be joining the Baton Rouge SQL Server User Group for their monthly meeting.

SSIS Training with Tim Mitchell

I’ll be presenting Real World SSIS: A Survival Guide, which is aimed at beginning-to-intermediate SSIS developers.  In this day-long training session, I’ll be sharing and demonstrating many of the ETL lessons that I’ve learned in my 10+ years working in the SQL Server business intelligence ecosystem.

At the same time, Bill Pearson will be delivering Practical Self-Service BI with PowerPivot for Excel, which will provide a crash course for those who are new to PowerPivot.  Following these day-long presentations, Bill will also share more on PowerPivot at the Baton Rouge SQL Server User Group that evening.

Registration for both of these day-long courses is currently open, and early-bird pricing is available for a limited time.  If you’re around the Baton Rouge area and are interested in learning more about SSIS or PowerPivot, we’d love to have you join us next month!

New Blog Series: Parent-Child Architecture in SSIS

I’m kicking off a new series of blog posts discussing the topic of parent-child architectures in SQL Server Integration Services.  The links to the posts in this series are below.

elephant I still remember the first SSIS package I ever deployed to a production environment.  It was the mid-2000s, and I was working on a large data migration project which would take my then-employer, an acute care hospital, from an old UNIX-based system into a modern SQL Server-based OLTP back end.  The entire solution, which pushed around a few hundred million rows of data, was completely contained in a single SSIS package.  And this thing was HUGE.  When I say huge, I mean that the package metadata alone was 5mb in size.  I had a bunch of duplicate code in there, and when I opened or modified the package, it took sometimes a minute or more to go through the validation for the dozens of different tasks and data flows.  In hindsight, I can admit that it was not a well-designed architecture.

Fast forward about a decade.  Having learned some lessons – the hard way – about ETL architecture, I’ve relied on a completely different way of thinking.  Rather than relying on a few, do-everything SSIS packages, I prefer to break out tasks into smaller units of work.  In using more packages that each do just one thing, I’ve discovered that this architecture is:

  • Easier to understand
  • Simpler to debug
  • Much easier to distribute tasks to multiple developers
  • In some cases, better performing

As part of my role as an independent consultant, I also do quite a bit of training, and in those training sessions the topic of parent-child ETL architecture comes up quite often.  How many packages should I have?  Should we have lots of small SSIS packages, or fewer, larger packages?  This is also a topic on which I find a lot of questions on SQL Server discussion forums as well.

To share my experience on this topic, I’m starting a new series of post discussing parent-child architectures in SSIS.  As part of this series, I will cover:

  • Overview of parent-child architecture in SSIS
  • Parent-child architecture in SSIS 2005 and 2008
  • Parent-child architecture in SSIS 2012
  • Passing configuration values from parent to child package
  • Passing values from child packages back to the parent
  • Error handling and logging in parent-child structures
  • Parent-child architectures in an ETL framework

I’m looking forward to writing this series over the next few months.  As always I look forward to your feedback.

Continue Package Execution After Error in SSIS

When it comes to ETL, I’m a pessimist.  Until proven otherwise, I assume that all data is bad, all connections are volatile, and all transformation logic is suspect.  As such, I spent a lot of time addressing how to prepare for and handle errors in the ETL pipeline with SSIS.  Building packages to expect and properly handle errors is a key component in a well-performing, durable, trustworthy ETL infrastructure.

SQL Server Integration Services has a number of built-in tools for handling errors and other anomalies.  Among these:

  • Event handlers, defined at the package, container, or task level
  • Precedence constraints on the data flow to change the execution path in the event of an error
  • Error outputs on sources and destinations in the data flow to redirect errors on a row-by-row basis
  • Fail the task (and its ancestors), which is the default behavior

All of these options provide a way to handle or otherwise report the error.  But what if your scenario requires that any error is simply ignored?  SSIS can do that, too.  In this post, I’ll show you how to use a particular attribute of event handlers to prevent the propagation of errors in SSIS executions.

Scenario

Continue after error. It's not that hard.

Continue after error. It’s not that hard.

Let’s say that we have an SSIS package that processes a series of flat files using an instance of the ForEach Loop Container.  The expected behavior is that we specify the directory of files to be processed, and the loop container will process the specified file(s) in that directory.  If a file exists in that directory that fails to process – perhaps its security settings do not allow it to be read by the account executing the SSIS package, or the file is improperly formatted – the default behavior is that the loop container would fail, which would fail the package and any parent package that executes it.  This is known as propagation (remember that word), which can be visualized as a “bubbling up” of errors.  In that way, propagation of errors is not unlike the way exceptions will bubble up in C# or VB.NET programming.

Now in this same scenario, let’s assume that our business needs dictate that we don’t have to successfully process all of the files for the ETL load to be considered a success, and having a partial load of data has greater business value than failing the entire load if one file fails to load.  In that case, we’ll want to override the default behavior of our package to allow certain elements to fail without affecting the outcome of the package.

As shown below, I’ve set up a package for our scenario in which we truncate the output staging table, and then loop through a specified directory to process each text file in that directory.

image

Notice that I’ve got precedence constraints set up after the data flow task in the loop container.  This will allow me to log either a success or failure of a load operation on a file-by-file basis.  The data flow task, which will be executed once per file in the source directory, will attempt to process each text file in that directory as shown below.

image

In this updated scenario, I don’t want to allow a failed file load to interrupt the load of the remainder of the files, so I’m going to make a couple of modifications.  First of all, I’ll create an error event handler for the data flow task.

image

You’ll notice that there are no tasks in the error event handler for the data flow task (DFT Load File).  Although you can add some custom logic here to execute upon error, you don’t necessarily need to do so.  However, creating the error event handler will expose a particular setting we’ll need to prevent the propagation of any errors.

While the error event handler shown above is still in view, open the list of SSIS variables.  Note that you’ll also have to set the variables window to show system variables, which are hidden by default.  In the list of variables, scroll down until you find a variable named Propagate.  This Boolean value is the setting that specifies whether errors within a given executable to bubble up to the ancestor tasks, containers, and packages.  To prevent these errors from bubbling up, I’ll change the value of the Propagate variable from True to False.

SNAGHTMLc26f2f4

The net result is that any error in this data flow task will still be shown as an error in this task, but that error will short circuit and will not be passed along to any executables further up the chain.

Final Thought

You’ll notice that in the control flow of this package, I included two different paths from the data flow task – one with the On Success constraint, and the other with the On Failure constraint.  I’ve done this to allow the package to separately log the file-by-file results (success or failure, along with row count).  The reason I’ve shown this is because I want to emphasize that, in most situations, you should be logging any error such as this – even if your package is built not to fail when an error is encountered.  In my opinion, most any operation that it attempted, even if it doesn’t affect the outcome of the package, should be logged – especially in the event of failure.

Conclusion

In this post I’ve demonstrated how the built-in components in SSIS can be used to allow for the continued operation after the failure of one executable in a package.  By modifying the Propagate system variable in the error event handler of a volatile executable, you can prevent the automatic failure of upstream package elements, allowing continued execution after the failure of noncritical operations.