One of the more common questions I find in SQL Server SSIS forums is how to split apart a single text file into multiple outputs. Files extracted from foreign sources often arrive in nonstandard formats, and are too often unchangeable at the source and must be parsed during the import process. For unconventional custom transformations such as this, using the SSIS script component with multiple outputs is often the best tool for shredding and importing data.
In this post, I will demonstrate a relatively simple way to address complex input files using the SQL Server Integration Services script component. In particular, I’ll show how to use multiple outputs to process differing file types within a single file
Script Component With Multiple Outputs
This example uses a file layout I refer to as the “record type” format. A single file may contain many different record types, with the type identified by the first element in each record (line of text) in the file. You may find other formats, including those that are identified by their position in the file, a particular string of text, or number of elements; this example could be easily modified to handle these types of nonstandard formats. I’ve spent a good part of my career working in and around healthcare data, and I’ve found this record type layout is quite common in EDI (electronic data interchange) situations.
As shown in the example below, there are varying numbers of elements within this file source, which would pose a challenge if you simply use the off-the-shelf transformations.
|1|202|Scott,Michael|District Manager|Scranton PA
3|241|9582|Bob Vance Refrigeration
3|189|2792|State of Pennsylvania
3|189|4929|Office Systems of America
3|493|2555|Ace Printing Center
In the snippet for this demonstration, record type 1 contains manager details, record type 2 lists the employee information, and record type 3 shows the list of clients. Since the number of columns and their respective data types differ from one record to the next, I can’t just use a conditional split to create multiple paths for data flow; this is a case for the script component.
To get started, I created a data flow task and added a new script component to it. Since I’m using the script component as a source, I must select that configuration when prompted (shown below).
Opening the UI for this script component reveals the Inputs and Outputs window. When I selected that I wanted to use this component as a source, it created one default output without any inputs, as shown below.
This example requires not one but three outputs, one each for Managers, Employees, and Clients. To create the additional outputs, I’ve used the Add Output function. For each output, I have also configured the column names and data types. As shown, I’ve used meaningful names (rather than the default Output 0, Column 1, etc.) for the outputs as well as the columns – this will come in handy shortly.
Next up: it’s time to write some code. I’ve configured the script to parse the input file using a System.IO.StreamReader object. The file name is defined in the SSIS package as the variable named vInputFilename, which has already been populated with the path to the source file.
I used a C# while loop to read each line of the file into a variable and process each line in turn. The test data is pipe-delimited, so I use the C# string function Split() to break apart each line into atomic elements, and evaluate the first element on each line (referred to as items, as a zero-based index) to send each row of data to the correct output.
The script logic follow one of three paths depending on the first element of the row, with each path leading to one of the buffers to be wired up to one of our intended destinations. The buffer objects, which follow the <OutputName>Buffer naming convention, are objects that are automagically created for each output defined in the editor in the previous step. In each of the paths, I’ve programmatically created a new output row to send to the data stream by calling the Buffer.AddRow() method. Note that you must call the AddRow() method before attempting to assign values to any of the output variables for that row. After adding the new output row for the appropriate path, the script will assign the corresponding value from the data file to each of the output variables.
Use the Multiple Outputs
After saving the changes to the script component with multiple outputs configured, I built a data destination for each output type. In this example, I’ve used a Flat File Destination to send each of the three data streams to a delimited file. After creating the output destinations, I connected the output (green arrow) from the script to each of the 3 outputs. When using multiple outputs, the SSIS UI will prompt to select the proper output as shown below. Be sure to match the output from the script component to the corresponding destination.
After all three of the destinations are configured and properly connected to their respective outputs, the package is ready to execute. The output paths are labeled by default with the name of the output; this is where the use of meaningful and descriptive names becomes valuable. The package practically documents itself!
When executing the package, the UI shows the rows of data flowing from the monolithic file into their respective outputs
A final review of the 3 output files confirms that the rows were directed to the correct destinations.
In this brief demonstration, I demonstrated how using the SSIS script component with multiple outputs can be used to extract data from source files that have varying metadata. The record type format is just one example of such files, but this same design pattern could be used for differing or more complex ETL needs.
Author’s note: Large portions of this post were previously published. This reposted material has been updated to reflect changes in SSIS since the last publication.
Very good post Tim. I never faced this use case but I like the way it is handled. Very knowledgeable for people like me. Thanks a lot 🙂
Thanks for the feedback, Prateek!
“Since the number of columns and their respective data types differ from one record to the next, I can’t just use a conditional split to create multiple paths for data flow; ”
Actually, you can — if you set up your source file connection well.
Configure your source file connection with two columns: the first (RecordType) delimited by pipe (|) and the second (RowData) delimited by a new line.
Use the conditional split to separate by RecordType, and then use DerivedColumn transformations in each output flow to split RowData into the relevant columns for that RecordType.
Hi Steve – you’re right, but this assumes that the data types are the same. With this example, though, it wouldn’t work because the data type for a single column will vary from one record type to the next.
No assumption needed.
The data type for the common column, RecordType, does not change.
The data types for the derived columns would be set in the Derived Column transformations.
I use this Derived Column pattern regularly for these types of files.
Agreed – adding a new derived column transformation would allow you to create a new column to manipulate data types. Scripting just does the splitting and data type handling in one step.
I don’t know C#, and had made a process roughly equivalent to this using T-SQL and Jeff Moden’s DelimitedSplit8K.(I am running into an entirely separate issue with that method: https://www.sqlservercentral.com/Forums/2013412/Query-conversion-error-from-phantom-data )
One of the issues I discovered in my source data was occasional line feeds within a field (Notes where docs had put separate portions of their comments in separate lines within the comment field of the EHR) that required me to build a correction and re-import process. Is that possible to do in C#? i.e. if a row is too short for its destination, append the next row or sometimes the next two rows.
How fast does the C# code run?
Speed is directly related to data volume, but I’ve used this pattern on several occasions and have been happy with the way it performs. As to literal newlines in the data – the short answer is that it can be done, but the code would need to be modified to differentiate between a newline as a control character and a literal newline in the data.
Would it work to have each statement run like “if (items == “1” and items.Length==4)” with an “Else” at the end to catch the errors and pass the “line” string for those rows to a separate export to fix manually?
Hi Tim, I have mixed data set in a text file, and do not have any specific delimiter. There are multiple record types in the in the file, and each record type can be identified by the value in the first position. Each record type has their own values and will not coincide with the positions as compared to the other record type. What would be best approach to tackle this scenario?