When processing data files using SQL Server Integration Services, it is not uncommon to find files with different end-of-line markers for each line in the file. In this post, I will demonstrate how to fix inconsistent line terminators in SSIS to avoid ETL errors.
Fix Inconsistent Line Terminators in SSIS
In every text file, there are unprintable characters called line terminators that mark the end of each line. On most UNIX and Linux systems and some older operating systems, files are created using the line feed character (LF, or ASCII character 10), while files generated in Windows typically use the carriage return and line feed (CRLF, or ASCII characters 13 and 10, respectively) to mark line endings. The tricky part is that these are nonprintable characters, so opening up a data file with Notepad or a similar editor will not present any visual differences between line feed-terminated files and those using carriage return plus line feed. However, these differences can significantly impact the execution of ETL processes and any downstream systems relying on data from those files.
To demonstrate the problem, I’ll show some examples of both LF and CRLF terminated files, and how they behave in SSIS when the package is expecting one type but gets the other. I’ll then demonstrate two workarounds to prevent unexpected errors due to changing line endings.
My imaginary client, Contoso, has a new supplier, and I’m setting up the ETL process to import flat file data from this new supplier. I’m working from a file spec that includes column-level mappings and indicates that lines will be terminated using CRLF. I build the package, test it against sample data provided by Contoso’s new supplier, and send the successfully tested package to QA for final testing and deployment to production.
Weeks go by, and the package works great. However, one morning I get a call from Contoso, asking for advice in troubleshooting this new process. It appears that the package has failed without loading any data, logging a data truncation issue upon failure. Both Contoso and their new supplier have reviewed the data file causing the failure, and cannot find any reason for the error. I open the file up in Notepad++ and turn on the Show Line Endings feature, and the problem becomes obvious. The most recently successful file looks like this:
However, the failed file looks like this:
The difference is subtle but important: The second file uses the line feed character as a terminator, while the previous file uses a carriage return and a line feed. This distinction is not visible when using tools such as Notepad, and in fact, even in Notepad++, these characters aren’t shown by default. However, even though these characters are not visible by default, the distinction is very important.
Why does it matter?
Although they are easy to forget about, incorrect line endings can wreck an ETL process. As shown in the hypothetical scenario above, in cases where the SSIS package expects to receive CRLF line endings but instead gets just an LF, most likely the package will fail due to either data truncation or data type issues. Even worse, if the package is set up to process LF line endings but receives a file with CRLF terminators, chances are good that the data will actually be loaded – with some extra baggage. In the latter case, if the last data field on the line is interpreted as a character data type (CHAR, NVARCHAR, etc.), the carriage return character would be preserved in the loaded data. In the example below, I’ll show how this can impact the quality of that data.
For this example, I’ve created an SSIS package to process a data file using LF line terminators. Then, I regenerate the same data file using CRLF line endings, and process the modified file. The package successfully loads the file, with no apparent problems. Below I can see in my target table that the data has been loaded.
Now, I want to find all products matching the first ItemName in the list. When I query this table using the exact ItemName value I just saw in my SSMS results window, I find that I get no results at all.
Even though I’m typing in the exact description I see, I get no results for that value. The reason is that I’m looking for the literal string ‘Mountain-100 Black, 42’, when in reality, the value in this field contains an unseen carriage return. Because the SSIS connection was configured to use LF as the line ending, it interprets the carriage return to be part of the data, and loads it to the output table. Copying that value from the SSMS results grid and pasting it into the query window confirms that the extra CR character is present at the end of the data value. Knowing this, I can modify the section criteria I used, changing the query from an exact match to a LIKE with a wildcard at the end to return the values I expected to see.
This confirms that the line ending is the problem, but what can be done to avoid this in the first place?
Fixing the Line Ending Problem
When coding to avoid issues with inconsistent line endings, there are three potential scenarios to plan for:
- Lines with LF line terminators
- Lines with CRLF line terminators
- Lines with CR line terminators (a far less common scenario)
Planning for the first two scenarios listed above is relatively easy; the last one takes a bit of work. I’ll demonstrate the design patterns for handling each of these.
Coding for LF and CRLF
As I mentioned earlier, files originally specified as LF endings then getting switched to CRLF (or vice versa) is more common that you might think. However, this problem is fairly easy to resolve using the SSIS data flow. First, the flat file source should be updated to use only a line feed for the line terminator, as shown below.
Next, on the data flow, add a derived column transformation to the data pipeline. This transformation will remove any carriage return values (indicated by “\r” in the SSIS expression) found in the last data field.
When using this pattern, the output will be the same regardless of whether the lines in the data file are terminated with LF or CRLF. For the latter, the package will simply remove the extra carriage return in the data flow. This is a very easy pattern to implement, and will provide protection against line endings changing from LF to CRLF, or vice versa.
Coding for CR, LF, or CRLF
Building a package to handle any type of line ending – CR, LF, or CRLF – takes a bit more work. Since the SSIS flat file connection manager must be configured for the type of line ending to expect, preparing for line endings that are not known at design time requires a more versatile source: the script component. Using the System.IO namespace in the script component, I can open the file, read through each line, and parse out the values irrespective of the line endings used.
In this example, I’ve added a new script component to the data flow, and I have configured this as a source. Next, I added output columns to the default output on the script component, which match the metadata in the table to which we will be loading the data. Finally, I wrote the code below which will read each line in the file, assigning the values to their respective columns in the output.
The reason this works in the C# code above and not in the flat file source is that C# treats lines within files a bit differently than the SSIS connection manager does. The flat file connection manager in SSIS has to be configured for a specific type of line ending, while the StreamReader.ReadLine() function simply reads to the end of the line irrespective of the line ending used.
Again, it’s been rare that I’ve had to code for the possibility of three different line ending types. However, I have seen this occur a few times, and for volatile data files, it’s a design pattern worth considering.
In the same way data professionals are usually skeptical of untested data, they must also be mindful of suspect metadata. Changes that appear to be as benign as a modification to the line terminator characters can have a serious negative impact on ETL and its target systems. Using the defensive ETL strategies I’ve shown here, you can help prevent errors or data quality issues related to changing line endings in data files.