Some time back I wrote about how to use the script component to parse out ragged data files in SSIS. In this post, I’ll continue the discussion to describe how to handle mixed format data files – specifically, those with several different record types in each file.
In a perfect world, a flat file will contain a single record type. However, my experience in healthcare several years back, as well as subsequent projects as a consultant, have taught me that there are many antipatterns around mixing file formats in a single data file. Combining several different record types into one file eliminates a small bit of work by eliminating multiple data files, but also adds complexity to those systems consuming the resulting hybrid files. In this post I’ll show an example of this, and I’ll demonstrate a design pattern to handle this type of data formatting.
Handling Mixed Format Data Files in SSIS
Consider the case of the following mixed format data file, which contains a combination of patient hospital visit records as well as procedures undertaken in each of those visits.
As shown, there are two different types of data represented here. The longer records, marked with a V in the first column, appear to have hospital visit information. Subsequent records, indicated by a P, show medical procedures related to those visits. Since the shape and purpose of these types of data are so different, it’s a near certainty that the contents of this file should be sent to two different outputs.
For older versions of SSIS – 2008 and 2008 – this would be problematic. The flat file source in those older versions was hard-wired to look for exactly the number of columns defined in the flat file connection manager, even if it meant reading data from the next line in the file to satisfy the number of expected columns. Fortunately, this behavior was fixed in SSIS 2012, in which any missing columns are simply filled in with NULLs. This change in behavior makes it possible to use native components (read: no scripting) to handle a file similar to the one above.
To process this file in SSIS, the flat file connection manager will be configured with enough column metadata to handle the file format with the greatest number of columns. In the example above, the V record (visit) has 11 columns, while the P record (procedure) has only five. Therefore, the flat file connection manager should be configured with 11 columns, with the data type of each column set up to handle the largest or broadest data type in any row type for that column. As shown below, I’ve configured a flat file connection manager with 11 generically named columns, which will be mapped to the appropriate output tables and columns further downstream.
Using the configuration above, all eleven columns will be populated by the visit records; the procedures record type will load only the first five columns, leaving NULL values in the rest. When connecting to this flat file connection manager in the SSIS data flow using a flat file source, both record types will be loaded into the data pipeline. Once that is in place, separating the records is as easy as using a conditional split based on the first column (V for visit, P for procedure).
As shown, Visit records (those matching [Column0] == “V”) are sent to the Visits output, while all others are sent to the default output named Procedures. Each output can then be mapped to its corresponding target table.
Finally, for each target table, the generic column names must be manually mapped to the corresponding output columns in the target table.
It bears mentioning again that this mixed format data file processing pattern works only in SSIS 2012 and 2014. To accomplish this in older versions, the scripting method I described in my earlier post would be used.
How would you turn this around to produce an OUTPUT file with mixed record types? I need to produce a file with a header record followed by 1 to n data rows. The header and data rows therefore have completely different structures.
HI Graham – That’s a bit more complicated. If you are to use an SSIS data flow for that, you’ll need to concatenate each set of rows into a single field, and then use a union all transformation to bring them all together. Depending on how complicated the metadata is (how many columns, delimiter, etc.), I might just avoid the data flow entirely and do it via script task.