SQL Saturday Lisbon

sqlsatportIt’s a little over a week until this year’s SQL Saturday festivities kick off in Lisbon, Portugal, and I’m very excited to be a part of it.  Registration is nearly full, so if you’re in the area and are planning on attending, register now!

For this event, I’m delivering a full day workshop entitled “Real World SSIS: A Survival Guide”, during which I’ll share design patterns and practical lessons I’ve learned over my 10-ish years in the BI/ETL space.  This workshop will be held on Thursday, April 10th (the Thursday prior to the main SQL Saturday event), and there are still some seats available.  You can register for this workshop online.  I’ve also recorded a teaser video of what’s to come in this workshop.

In addition to the full-day workshop on Thursday, I’ll also be presenting two, one-hour sessions on Saturday.  I’ll be sharing “Handling Errors and Data Anomalies in SSIS” and “15 Quick Tips for SSIS Performance” during the regular SQL Saturday event.

If you plan on attending SQL Saturday in Lisbon, please stop by and say hello!  I’m looking forward to seeing you there.

Parent-Child SSIS Architecture

This is the first in a series of technical posts on using parent-child architectures in SQL Server Integration Services.  The index page for all posts can be found here.

In this post, I will provide an overview of the architecture and describe the benefits of implementing a parent-child design pattern in SSIS structures.

Definition

The simplest definition of SSIS parent-child architecture is that it consists of packages executing other packages.  In SSIS, the package is the base executable; it is the most granular component that can be executed independently1.  Every version of SSIS includes the ability for one package to execute another package through the use of one of the following:

  • The Execute Package Task
  • T-SQL commands
  • The Execute Process Task (using the dtexec.exe utility)

Though the reference to parent-child architecture implies that there are exactly two layers to this design, that does not have to be the case.  You may have a design where a package executes a package which executes a package, and so forth.  Although there may be a hard limit to how deeply nested a parent-child architecture may go, I have never encountered such a limitation.  I have found it useful on a few occasions to go deeper than two levels in this type of architecture, particularly when designing a formal ETL framework (to be discussed further in a future post in this series).  In cases where greater than two levels exist, finding the right terminology for those layers is important.  You can refer to them by patriarchy (grandparent/parent/child) or by cardinality (level 1, level 2, level n), as long as you remain consistent – especially in your documentation – with those references.

Conceptually, a parent-child architecture is a form of code abstraction.  By encapsulating ETL actions into discrete units of work (packages), we’re creating a network of moving parts that can be developed, tested, and executed independently or as part of a larger collection.

Benefits

As I mentioned in my introductory post, there are several benefits to using parent-child structures in SSIS.

Reusability.  In any ETL environment of significant size or complexity, it’s quite normal to discover common ETL behaviors that are reusable across different implementations.  For a concrete example of this: In my spare time, I’m working on an ETL application that downloads XML files from a Major League Baseball web service.  There are files of various formats, and each file format is processed a different way, but with respect to the download of the files, I always perform the same set of operations: create a log entry for the file; attempt to download the file to the local server; log the result (success or failure) of the download operation; if the download has failed, set the HasErrors variable on the main package.  If I were to load this behavior into a group of tasks in the package for each XML format, I’d have five different copies of the same logic.  However, by building a parameterized child package that performs all of these core functions, I only have to build the file download/logging logic once, and execute the resulting package with the appropriate parameters each time I need to download a file.

Easier development.  Working with large and complex SSIS packages can be a pain.  The larger the SSIS packages, the longer it takes for the BIDS or SSDT environment to do its validation checks when the package is opened or modified.  Further, when multiple ETL developers are working on the same project, it is much easier to break apart the project into discrete units of work when using numerous smaller SSIS packages.

Easier testing and debugging.  When working through the test and debug cycles during and after initial development, it’s almost always easier to test and debug smaller packages.  To test a single task that resides in a large SSIS package would require either running the task by itself manually in the Visual Studio designer, or disabling all of the other tasks and redeploying the package.  When working with packages that each perform one unit of work, one can often simply execute the package to be tested through the normal scheduling/execution mechanism.

Clarity of purpose. An architecture that uses small, single-operation packages lends itself to clarity of purpose by virtue of naming.  When browsing a list of deployed packages, it is much more clear to see package names such as “Load Customers Table”, “Merge Product Table”, and “Remove Duplicates in Vehicle Table” than to find do-everything packages with names like “Load Production DB”, “Update DW”, etc.

Performance. In some cases, breaking out multi-step SSIS package can bring some performance gains.  One distinct case that comes to mind is using a distributed architecture, where packages within a single execution group are executed on multiple servers.  By distributing packages across different SQL Server machines (either physical or virtual), it may be possible to improve performance in cases where the processing load on a single SSIS server has become a bottleneck.  I want to emphasize that using a parent-child architecture does not arbitrarily improve performance, so this should not be used as a silver bullet to improve a poorly performing group of packages.

The Tools

As I mentioned earlier, there are three tools that can be used to execute a package from within another package.

The execute package task.  This is the easiest and most common means of executing a package from within another.  This task can trigger the execution of a package stored on the file system, deployed to MSDB or the SSIS catalog, or residing in the same project.  If using SSIS 2012 with catalog deployment mode, you can also use the execute package task to pass parameter values from the parent package to the child package.  It is important to note that the execute package task behaves differently in SSIS 2012 than it does in older versions.

T-SQL commands (via the execute SQL task).  For SSIS projects using project deployment model in SSIS 2012, the built-in stored procedures in the SSIS catalog can be used to execute packages.  This method for executing packages, like the execute package task, allows you to specify runtime parameters via T-SQL code.  One significant advantage of using T-SQL commands to execute packages is that, unlike the execute package task, you can use expressions to set at runtime the name of the package to be executed.  This is useful in cases where you are iterating over a list of packages that may not be known at runtime, such as a pattern found in ETL frameworks.

dtexec.exe (via the execute process task).  Using this method allows you to trigger package execution via the command-line application dtexec.exe.  Although this method is typically used to execute packages in a standalone environment – for example, when using third-party scheduling tools to orchestrate package execution – but dtexec can also be used within SSIS by way of the execute process task.  As an aside, I rarely use dtexec to execute child packages – in most cases, it’s easier to use either the execute package task or T-SQL commands to execute one package from within another.

I’ll also briefly mention dtexecui.exe.  This is a graphical tool that serves the same purpose as dtexec.exe, except that the former exposes functionality via a graphical user interface rather than forcing the user to use command-line parameters for configuration.  Except for this brief mention, I’ll not cover dtexecui.exe in this discussion of parent-child architecture, as that tool is intended for interactive (manual) execution of packages and is not a suitable tool for executing one package from within another.

Parent-Child architecture in the real world

To illustrate how this can work, let’s model out a realistic example.  Imagine that we have charge over the development of a sizeable healthcare database.  In addition to our production data, we’ve got multiple environments – test, development, and training – to support the development life cycle and education needs.  As is typical for these types of environments, these databases need to be refreshed from the production database from time to time.

The refresh processes for each of these environments will look similar to the others.  In each of them, we will extract any necessary data for that environment, retrieve and restore the backup from production, and import the previously extracted data back into that environment.  Since we are dealing with sensitive healthcare data, the information in the training database needs to be sufficiently anonymized to avoid an inappropriate disclosure of data.  In addition, our test database needs to be loaded with some test cases to facilitate testing for potential vulnerabilities.  Even though there are some differences in the way each environment is refreshed, there are several points of shared – and duplicate – behavior, as shown below (with the duplicates in blue).

ParentChild-Duplicate

Instead of using duplicate static elements, we can eliminate some code redundancy and maintenance overhead by encapsulating those shared behavior into their own container – specifically, a parameterized package.  In doing so, we can avoid having multiple points of administration when (not if) we need to make adjustments to those common elements of the refresh process.  The updated architecture uses parameters (or package configurations, if using package deployment mode in SSIS 2012 or any older version of SSIS) to pass in the name of the database environment to refresh.

ParentChild-HighLevel

As shown, we’ve moved those shared behaviors into a separate package (RefreshDB), the behavior of which is driven by the parameters passed into it.  The duplicate code is gone.  We now have just one SSIS package, instead of three, that need to be altered when those common behaviors change.  Further, we can individually test and debug the child package containing those common behaviors, without the additional environment-specific operations.

Note that we haven’t reduced the number of packages using this architecture.  The goal isn’t fewer packages.  We’re aiming for a modularized, easy-to-maintain design, which typically results in a larger number of packages that each perform just a few (and sometimes just one) functions.  In fact, in the parent-child architecture shown above, we could even further refine this pattern by breaking out the individual operations in the RefreshDB package into packages of their own, which would be practical for cases in which those tasks might be executed apart from the others.

Exceptions to the rule

Are there cases in which parent-child structures do not add value?  Certainly.  A prime example of such a case is a small, simple package developed for a single execution with no expectation that its logic will be reused.  I call these throwaway packages.  Because of their single-use nature, there is likely little value in going through the effort to building a parent-child architecture around their business logic.

Up Next

In my next post in this series, I’ll work through the mechanics of using a parent-child pattern in SSIS 2005 or SSIS 2008.

1 Technically, there are lower-level items in the SSIS infrastructure that can be executed independently.  For example, from the BIDS or SSDT design surface, one can manually execute a single task or container within a package.  However, when deploying or scheduling the execution of some ETL behavior, the package is the lowest level of granularity that can be addressed.

Edit: Corrected typo on one of the graphics.

SSIS and PowerPivot training in Baton Rouge

I’m happy to announce that I’ll be teaming up with my Linchpin People colleague Bill Pearson for a day of BI and SSIS training next month.  On Wednesday, February 12th, we’ll each be delivering a full-day presentation in Baton Rouge, after which we’ll be joining the Baton Rouge SQL Server User Group for their monthly meeting.

SSIS Training with Tim Mitchell

I’ll be presenting Real World SSIS: A Survival Guide, which is aimed at beginning-to-intermediate SSIS developers.  In this day-long training session, I’ll be sharing and demonstrating many of the ETL lessons that I’ve learned in my 10+ years working in the SQL Server business intelligence ecosystem.

At the same time, Bill Pearson will be delivering Practical Self-Service BI with PowerPivot for Excel, which will provide a crash course for those who are new to PowerPivot.  Following these day-long presentations, Bill will also share more on PowerPivot at the Baton Rouge SQL Server User Group that evening.

Registration for both of these day-long courses is currently open, and early-bird pricing is available for a limited time.  If you’re around the Baton Rouge area and are interested in learning more about SSIS or PowerPivot, we’d love to have you join us next month!

New Blog Series: Parent-Child Architecture in SSIS

This month I’m kicking off a new series of blog posts discussing the topic of parent-child architectures in SQL Server Integration Services.

elephant I still remember the first SSIS package I ever deployed to a production environment.  It was the mid-2000s, and I was working on a large data migration project which would take my then-employer, an acute care hospital, from an old UNIX-based system into a modern SQL Server-based OLTP back end.  The entire solution, which pushed around a few hundred million rows of data, was completely contained in a single SSIS package.  And this thing was HUGE.  When I say huge, I mean that the package metadata alone was 5mb in size.  I had a bunch of duplicate code in there, and when I opened or modified the package, it took sometimes a minute or more to go through the validation for the dozens of different tasks and data flows.  In hindsight, I can admit that it was not a well-designed architecture.

Fast forward about a decade.  Having learned some lessons – the hard way – about ETL architecture, I’ve relied on a completely different way of thinking.  Rather than relying on a few, do-everything SSIS packages, I prefer to break out tasks into smaller units of work.  In using more packages that each do just one thing, I’ve discovered that this architecture is:

  • Easier to understand
  • Simpler to debug
  • Much easier to distribute tasks to multiple developers
  • In some cases, better performing

As part of my role as an independent consultant, I also do quite a bit of training, and in those training sessions the topic of parent-child ETL architecture comes up quite often.  How many packages should I have?  Should we have lots of small SSIS packages, or fewer, larger packages?  This is also a topic on which I find a lot of questions on SQL Server discussion forums as well.

To share my experience on this topic, I’m starting a new series of post discussing parent-child architectures in SSIS.  As part of this series, I will cover:

  • Overview of parent-child architecture in SSIS
  • Parent-child architecture in SSIS 2005 and 2008
  • Parent-child architecture in SSIS 2012
  • Passing configuration values from parent to child package
  • Passing values from child packages back to the parent
  • Error handling and logging in parent-child structures
  • Parent-child architectures in an ETL framework

I’m looking forward to writing this series over the next few months.  As always I look forward to your feedback.

Continue Package Execution After Error in SSIS

When it comes to ETL, I’m a pessimist.  Until proven otherwise, I assume that all data is bad, all connections are volatile, and all transformation logic is suspect.  As such, I spent a lot of time addressing how to prepare for and handle errors in the ETL pipeline with SSIS.  Building packages to expect and properly handle errors is a key component in a well-performing, durable, trustworthy ETL infrastructure.

SQL Server Integration Services has a number of built-in tools for handling errors and other anomalies.  Among these:

  • Event handlers, defined at the package, container, or task level
  • Precedence constraints on the data flow to change the execution path in the event of an error
  • Error outputs on sources and destinations in the data flow to redirect errors on a row-by-row basis
  • Fail the task (and its ancestors), which is the default behavior

All of these options provide a way to handle or otherwise report the error.  But what if your scenario requires that any error is simply ignored?  SSIS can do that, too.  In this post, I’ll show you how to use a particular attribute of event handlers to prevent the propagation of errors in SSIS executions.

Scenario

Continue after error. It's not that hard.

Continue after error. It’s not that hard.

Let’s say that we have an SSIS package that processes a series of flat files using an instance of the ForEach Loop Container.  The expected behavior is that we specify the directory of files to be processed, and the loop container will process the specified file(s) in that directory.  If a file exists in that directory that fails to process – perhaps its security settings do not allow it to be read by the account executing the SSIS package, or the file is improperly formatted – the default behavior is that the loop container would fail, which would fail the package and any parent package that executes it.  This is known as propagation (remember that word), which can be visualized as a “bubbling up” of errors.  In that way, propagation of errors is not unlike the way exceptions will bubble up in C# or VB.NET programming.

Now in this same scenario, let’s assume that our business needs dictate that we don’t have to successfully process all of the files for the ETL load to be considered a success, and having a partial load of data has greater business value than failing the entire load if one file fails to load.  In that case, we’ll want to override the default behavior of our package to allow certain elements to fail without affecting the outcome of the package.

As shown below, I’ve set up a package for our scenario in which we truncate the output staging table, and then loop through a specified directory to process each text file in that directory.

image

Notice that I’ve got precedence constraints set up after the data flow task in the loop container.  This will allow me to log either a success or failure of a load operation on a file-by-file basis.  The data flow task, which will be executed once per file in the source directory, will attempt to process each text file in that directory as shown below.

image

In this updated scenario, I don’t want to allow a failed file load to interrupt the load of the remainder of the files, so I’m going to make a couple of modifications.  First of all, I’ll create an error event handler for the data flow task.

image

You’ll notice that there are no tasks in the error event handler for the data flow task (DFT Load File).  Although you can add some custom logic here to execute upon error, you don’t necessarily need to do so.  However, creating the error event handler will expose a particular setting we’ll need to prevent the propagation of any errors.

While the error event handler shown above is still in view, open the list of SSIS variables.  Note that you’ll also have to set the variables window to show system variables, which are hidden by default.  In the list of variables, scroll down until you find a variable named Propagate.  This Boolean value is the setting that specifies whether errors within a given executable to bubble up to the ancestor tasks, containers, and packages.  To prevent these errors from bubbling up, I’ll change the value of the Propagate variable from True to False.

SNAGHTMLc26f2f4

The net result is that any error in this data flow task will still be shown as an error in this task, but that error will short circuit and will not be passed along to any executables further up the chain.

Final Thought

You’ll notice that in the control flow of this package, I included two different paths from the data flow task – one with the On Success constraint, and the other with the On Failure constraint.  I’ve done this to allow the package to separately log the file-by-file results (success or failure, along with row count).  The reason I’ve shown this is because I want to emphasize that, in most situations, you should be logging any error such as this – even if your package is built not to fail when an error is encountered.  In my opinion, most any operation that it attempted, even if it doesn’t affect the outcome of the package, should be logged – especially in the event of failure.

Conclusion

In this post I’ve demonstrated how the built-in components in SSIS can be used to allow for the continued operation after the failure of one executable in a package.  By modifying the Propagate system variable in the error event handler of a volatile executable, you can prevent the automatic failure of upstream package elements, allowing continued execution after the failure of noncritical operations.

Upcoming SQL Saturday Precons

SQLSatI’m happy to announce that I’ll be delivering three, one-day preconference seminars this summer prior to three different SQL Saturday events:

For each of these events, I’ll be delivering a full day of content entitled “Real World SSIS: A Survival Guide.”  This day of content consists of many lessons that I’ve learned – many of which were learned the hard way – through my decade or so in this business.  I’ve got lots of demos to illustrate the concepts we’ll be covering.

If you are able to make it to any of these SQL Saturday events, I’d be honored if you’d join me for one of these talks.  Registration is open for all three seminars, as well as each of the SQL Saturday events.  I hope to see you there!

Using the SSIS Object Variable as a Result Set Enumerator

tmitch2In the first post in this series, I covered the basics of object typed variables in SQL Server Integration Services, along with a brief examination of some potential use cases.  In this installment, I’m going to illustrate the most common use of object typed variables in SSIS: using an object variable as an ADO recordset within a loop container to perform iterative logic.

Before we examine the how, let’s talk about the why.  Although this is not a design pattern you’ll have to use every day, there are any number of cases that would lend themselves to building and using an ADO recordset enumerator:

  • You need to create a series of export files – one per client – showing that client’s charges for a given period.
  • You’re dealing with a very large set of data, and/or your processing hardware has limited resources.  You want to explore breaking up the workload into smaller chunks to be processed serially.
  • You are performing a data load operation, and want to design the package in such a way that the loaded data can be immediately used as a validation source in the same package execution.

For cases such as these (among others), using this design pattern can be an excellent way to address your ETL needs.

Design Pattern

At a high level, this design pattern will have three moving parts:

  • A relational query used to populate the object variable (thus transforming its internal type into an ADO recordset)
  • A For Each Loop container to loop through the list stored in this variable
  • Some business logic for each value (or set of values) in each row of the object variable

Note that while the first two moving parts I mentioned will be relatively consistent from one package to another, the business logic component will, by nature, vary greatly from one package to another.  For the purposes of this post, I’m purposefully keeping my business logic piece simple so as to not distract from the larger design pattern.

For my sample data, I’m  going to deal with a data domain that is near and dear to my heart: baseball.  In this case I want to get a list of all postseason baseball games, and for each game, create an export file detailing the at-bat statistics for that game.  Because I don’t know at design time how many games will be played in the postseason, I can’t simply hard-code my outputs – I need design the ETL in such a way that the data will dictate, at runtime, the number of output files and their respective filenames.

Configuring and Populate the Object Variable

The first thing I’ll do in my demo package is set up an SSIS variable, giving it the data type of Object.  As shown below, I’m using the SSIS variable named [GameList] as the object typed variable, which will store the ADO recordset list of playoff game IDs that should be processed.  Also included is a variable specifying the directory to which the output files will be written, as well as a variable to store the individual game ID for each iteration of the loop.

variables

Next up, I’m going to add an instance of the Execute SQL Task to the control flow of my package, typing in my query to select the IDs of the playoff games from the database.  In the settings for this task shown below, you’ll also see in the highlighted portion that I’ve changed the behavior of the Result Set to use Full result set (remember the default is None, which would expect no data rows to be returned).  By setting this behavior, I’m configuring the task to expect a result set to be returned.

fullresultset

When I configure the Result Set setting in this way, I also need to indicate where those results should end up – specifically, I have to indicate which object typed variable will store these results.  In the Result Set tab of the same task, I’ll set the variable name to use the [GameList] variable I set up in the previous step.  Also note that the result set name should always be 0 in this case.

resultsetmapping

What I’ve done here is quite simple, and required no code (other than the SQL statement, of course).  What’s happening behind the scenes is a little more complex, however.  At runtime when the Execute SQL Task is executed, the [GameList] variable will be instantiated as a new object of type ADO recordset.  Note that this action will not change the data type shown in SSIS; even though the in-memory object will be configured as an ADO recordset, it will still show up as an object type variable in the designer.  This ADO recordset object will then be loaded with the resulting records, if any, from the query I used in the Execute SQL Task.

Using the SSIS Variable as an Enumerator

My next step will be to consume that list, processing each game ID in turn to extract the data I need.  To handle this, I’ll add a For Each Loop container to the control flow, and connect the previously configured instance of Execute SQL Task to this new container.  When I configure the properties for the loop container, in the Collection tab I’m presented with several different options for the enumerator (the list that controls how many times the logic within the loop will be executed).  Since I’m working from the ADO recordset list created in the previous step, I’m going to select Foreach ADO Enumerator, and use the variable drop down list to select the [GameList] object variable.  I also set the Enumeration Mode to use Rows in the first table, which is the only option I can use when working with a ADO recordset (note that we have more options when working with an ADO.NET recordset, which I plan to cover in a future post).

foreachconfig

With the collection tab set to use my object variable as an enumerator, I’ll next jump over to the Variable Mappings tab.  It is on this tab where I will align fields in the record set with variables in the package.  As shown below, I’m only expecting one column to be returned, and for each iteration of the loop, this value will be stored in the variable named [ThisGameID].  As you can see, I’m using index [0] to indicate the position of this value; if the record set is expected to return more than one column, I could add those in as additional column/variable mappings, using the ordinal position of each column to map to the proper SSIS variable.

variablemapping

With that done, I’ll add an instance of the Data Flow Task to the loop container configured above, which will complete the work on the control flow:

controlflow

Configure the Business Logic in the Data Flow

Now it’s time to dive into the data flow I just created.  Within that data flow, I’ll add a new OLE DB Connection component, the purpose of which will be to retrieve the at-bat statistics for each playoff game. To the output of that source, I will attach an instance of the Flat File Destination, which will be used to send each game’s data to the respective output file.

dataflow

Within the data source, I need to configure the query such that it retrieves data for one and only one game at a time.  Since the current game ID value is stored in the [ThisGameID] SSIS variable, I can simply map that variable as a query parameter, so that each execution of this SELECT query will limit the results to only include statistics for the current game ID.  As shown below, I’m using a parameter placeholder (the question mark in the query) to indicate the use of a parameter:

query

… and when I click the Parameters… button, I can map the SSIS variable containing the game ID to that query parameter:

queryparams

I have already configured an instance of the Flat File Destination (and by extension, set up the Flat File Connection Manager) to allow me to write out the results to a file, but how will I create a separate file per game?  It’s actually quite easy: by using a simple SSIS expression on the ConnectionString property of the Flat File Connection Manager, I can configure the output file name to change on each iteration of the loop by using the game ID value as part of the file name.  As shown below, I’m accessing the Expressions collection within my Flat File Connection manager, overriding the static value of the ConnectionString property with an interpreted value using the amalgamation of two variables – the directory location I specified earlier, along with the current game ID.  Remember that since SSIS variables are evaluated at runtime, the value of the variables can change during execution, thus allowing the use of a single Flat File Connection Manager to write out multiple files during each package execution.

fileconnstr

Finally, when I execute the configured package, I end up with a few dozen output files – one per playoff game.  As a side note, my Texas Rangers were only represented in one of those playoff games from last year.  We’ll get ‘em this year.  As shown below, each output file is distinctified with the game ID as part of the file name.

outputfiles

Conclusion

Use of the SSIS object typed variable can be a very powerful tool, but it need not be complex.  As shown in this example, we can easily leverage the object variable for iteration over a result set without writing a single line of programmatic code.

In the next post in this series, I’ll dig further into object typed SSIS variables, and will explore how to use and manipulate other types of objects not natively represented in SSIS.

Webinar: Scripting and SSIS

Tomorrow at 10am (11am EDT), I’ll be joining together with my good friend and SSIS Design Patterns coauthor Andy Leonard for a free one hour webinar to discuss scripting in SQL Server Integration Services:

Join SQL Server MVP Tim Mitchell and Andy Leonard as they discuss and demonstrate scripting in SSIS! In this demo-packed session, two co-authors of the book SSIS Design Patterns share their experience using the Script Task and Script Component to accomplish difficult transformations and improve data integration.

You can register online here.  We look forward to seeing you tomorrow!

Using Object Typed Variables in SSIS

Note: This will be the first post in a short series on using Object typed variables in SQL Server Integration Services.

When defining variables in SSIS, the ETL developer has several data type options to choose from depending on the information to be stored in each variable.  Included in the options are String, Boolean, Char, DateTime, and several flavors and sizes of Int.  However, there’s another variable data type that is very handy but also underutilized: the Object data type.

fig1

Why Object Variables?

Variables with a data type of Object are the most flexible variables in SSIS.  Data types such as Int, String, and DateType are designed to store just one type of data; however, an Object typed SSIS variable can store almost any type of data, even information that can’t otherwise be represented in Integration Services.  In most cases, SSIS doesn’t even have to be configured to know what type of data you’re storing in an Object typed variable – usually, it can simply pass the value along the wire as a bunch of bits without knowing or caring what’s in there.

When considering the use of Object typed variables in SSIS, I’ll give the same disclaimer that I give for using script tasks/components in SSIS: Just because you can doesn’t mean you should.  If a native data type will work to store any possible value for a particular variable, by all means, don’t complicate your code by adding unnecessary moving parts.  Use Object typed variables only when a native type won’t do – otherwise, stick to the well-worn path.

Common Uses

There are a few cases that come to mind that lend themselves to using Object typed variables in SSIS:

  • Iterating over a result set.  This is probably the most common and well-documented use of Object variable, as well as the easiest to implement as it requires no manual coding.  This pattern will allow you to retrieve a set of data from a database, and then perform some operation for each row in that result set.  I’ll work through the mechanics of how to do this in my next post.
  • Handling binary data.  A common example of this is extracting binary data from or writing binary data to a VARBINARY field in a relational database.  If you need an interim storage mechanism in which this binary data should be stored, and Object variable can be a good solution.  In this case, the ETL pipeline doesn’t necessarily need to be aware of what is stored in the Object typed variable .
  • Creating or consuming binary data in SSIS.  Let’s say you need to either generate or process binary data as part of your ETL.  For example, you might need to retrieve a JPEG or PNG file from the file system and write it into a relational database, or retrieve a binary object from a database and process the various elements of that object.  By storing said data in an Object typed variable, you can directly write to or read from this variable within your code.

Risks and challenges

Naturally, with a construct as flexible as an Object typed variables, there are a few challenges to be aware of when considering when and how to use objects in your SSIS packages.  Among the risks:

  • Some coding required.  In many cases, especially when you’re processing the information contained in the Object typed variable (as opposed to simply passing the value through from a source to a destination), you’re going to have to write some code to address that object.
  • SSIS expressions not allowed.  Because they are designed to store a variety of information structures, Object typed variables cannot be used in an SSIS expression.  Even if the underlying data stored in the variable is of a type that could be stored in a native SSIS type, attempting to add an Object typed variable to an SSIS expression will throw an error.
  • Debugging challenges. If you overflow an Int32, or happen to truncate a string in SSIS, you’ll get a (mostly) friendly error message telling you what went wrong.  Often, when dealing with Object typed variables, you don’t get that luxury in SSIS.

Conclusion

Object typed variables in SQL Server Integration Services allow a great deal of flexibility when dealing with atypical data structures in the ETL cycle.  Although they are not as commonly used as simple native types, Object typed variables can make otherwise difficult ETL tasks easier.

In my experience, I’ve found that Object typed variables are sometime avoided in SSIS packages simply because they’re misunderstood or believed to be too difficult to use.  In the next few posts in this series, I’ll illustrate how the Object variable can be leveraged in your SSIS package by demonstrating a few use cases where Object typed variables in SSIS are appropriate.

Ragged Flat File Processing in SSIS

Frequently I am asked how to properly use SQL Server Integration Services to handle ragged flat files, an example of which is shown below.  In this small sample file, you can see that we are dealing with medical data that has an inconsistent format.  As shown, the rows in the pipe-delimited file can have as few as seven and as many as eleven columns.

image

Although you and I can look at this file and interpret its meaning, previous versions of SSIS cannot (see note below regarding SSIS in SQL Server 2012).  You can see that we’re not only missing the data but the delimiters as well, which is the root of the problem.  This type of format where delimiters are simply left off when values at the end of the row are missing or NULL is, unfortunately, a relative common event that ETL developers must address.

With a file like this, SSIS will continue to consume data for each field until it find the specified field delimiter – even if that means that the data is “gobbled up” from the next line in the file!  Let’s assume that the outputs on my Flat File Source component define 11 fields (the largest number of fields found in this file snippet).  When SSIS processes the first row of data, it’s going to get to the end of that row still expecting to find a pipe delimiter to indicate the end of that field.  Rather than assuming the end of the line of data (represented by an unprintable newline character) is actually the end of the column, the Flat File Source component will continue to consume data until either it finds the specified delimiter or the length of the data consumed exceeds the width specified for that column.  In this example, the eighth field would not be the literal “453” at the end of the line in the file, but would contain a newline character followed by the first value of the following line of data.  Then the next field, which should be NULL, would contain “Reichert”, followed by the incorrect value of “781” and so forth.  As shown below, this slip would cascade through the remainder of the file.

file

In a case like this, the very best you can hope for is that the data length would overflow the allowed field width, or a data type conflict would cause an error and fail the package.  In a worst-case scenario, SSIS could actually consume the entire file with this misalignment as shown above, which may not raise an error in the package but would almost certainly cause downstream issues in the destination systems.

Scripting to the rescue!

Never fear – there is a solution to this problem.  By using just a bit of code in SSIS, we can easily get around this issue and process the ragged flat file.  The methodology we’ll use is as follows:

  • Using the Flat File Source, consume each line of text as one big string (rather than breaking it apart into individual fields)
  • Using the Script Component, parse each line of the file.  Dynamically populate the output fields with the data in each line of the file, and assign NULLs to those fields where no value is present.
  • Send the cleansed output to the destination table

Sounds easy enough, right?  Let’s go through this step by step.

Process each line of data

As mentioned, we’re going to trick our Flat File Source component into processing this file by forcing it to handle each line of data as if it were a single text field.  As shown below, I’ve configured a single output field, named [Column 0], as a field of type String with a width of 2000 characters.  This will include everything on the line of data, including the pipe delimiters, in the output.

SNAGHTML10b3b6d7

Next, we’ll add a script component into the data flow pane.  We’ll configure the input of the script component to use the single field generated by the flat file source above.

SNAGHTML10c1bb98

We also need to add all of the output fields manually, which we will populate from within the script.  As shown below, I’ve already added the eleven fields to the script component output, configuring each one with the appropriate data type and length.

SNAGHTML10c442c2

Now onto the fun part: The Code.  Within the script, our methodology will be as follows:

  • Break apart the big text string into its individual data fields, splitting on the specified delimiter (in this case, the pipe symbol).
  • If there are mandatory fields (i.e., any fields which must be present for a valid row of data), assign these to the appropriate output columns.
  • For the optional output columns, process the values retrieved from the input string, in order, assigning each one to the next output column until no other input values exist.

First things first – let’s grab and shred the input text.  As shown below, I’m going to split the input on the pipe delimiter, resulting in an array of strings, each with a single value.

// Break apart the input string into its atomic elements
string[] allValues = Row.Column0.Split(‘|’);

Next, we’ll work through all of the fields that are expected to be present in all valid rows of data.  In situations where there are mandatory fields which should always be present, no dynamic processing is required on those values, so we can simply assign them in order.  Note that I’ve used an incrementing counter variable to work through the array of input data – this is simply a shortcut step to allow me to copy/paste the same code rather than typing the cardinal numbers for each element.

// Counter
int i = 0;// Address the values known to be present in every row
Row.MRN = allValues[i++];
Row.PatientLastName = allValues[i++];
Row.PhysicianID = allValues[i++];
Row.ProcedureDate = allValues[i++];
Row.ProcedureID = allValues[i++];
Row.ProcedureDescription = allValues[i++];
Row.ProcedureStatus = allValues[i++];

Now we move on to the dynamic portion of the processing.  From this point, all columns are optional, so we need to go through the remaining values, one by one, and assign them to the final four output columns (all related to risk factor codes in this example).  As shown in the code snippet below, I’m continuing to work through the string array created earlier process all of the values.  As long as values remain in the array, we’ll continue assigning those values to the next risk factor code in order; when we run out of values, all remaining risk factor columns will be populated with a NULL string.

// The next four values are variable.  We may have zero to four of these, so
//  we’ll add them dynamically as they appear
Row.RiskFactor1 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor2 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor3 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor4 = allValues.Length > i ? allValues[i] : null;

We only have four possible additional columns in this scenario, but this pattern can easily be applied to similar scenarios with more than just a few optional elements.

Consume the cleansed data

Finally, we just need to route the cleansed data out to our destination.  As shown below, we’re sending the processed data to a SQL Server table.

image

SQL Server 2012 changes

As mentioned earlier, this is a design pattern that I recommend for versions of SSIS prior to 2012.  If you are using SSIS with SQL 2012, this workaround is unnecessary: the Flat File Source was updated in this latest version so that it allows easy processing of these types of ragged files.

Conclusion

Ragged flat file processing can be a pain when working with older versions of SQL Server Integration Services.  This post demonstrates how, with just a few lines of code, you can avoid issues with ragged files by dynamically “peeling off” values and assigning them to the appropriate outputs.

Note: The package and sample data file I used for this demonstration are attached here.