Tim Mitchell
Follow Tim Mitchell on Twitter  Like Tim Mitchell on Facebook  Subscribe on YouTube  Connect on LinkedIn  Subscribe to the Data Geek Newsletter

Fix Inconsistent Line Terminators in SSIS

Fix Inconsistent Line Terminators in SSISWhen processing data files using SQL Server Integration Services, it is not uncommon to find files with different end-of-line markers for each line in the file. In this post, I will demonstrate how to fix inconsistent line terminators in SSIS to avoid ETL errors.

Fix Inconsistent Line Terminators in SSIS

In every text file, there are unprintable characters called line terminators that mark the end of each line. On most UNIX and Linux systems and some older operating systems, files are created using the line feed character (LF, or ASCII character 10), while files generated in Windows typically use the carriage return and line feed (CRLF, or ASCII characters 13 and 10, respectively) to mark line endings. The tricky part is that these are nonprintable characters, so opening up a data file with Notepad or a similar editor will not present any visual differences between line feed-terminated files and those using carriage return plus line feed. However, these differences can significantly impact the execution of ETL processes and any downstream systems relying on data from those files.

To demonstrate the problem, I’ll show some examples of both LF and CRLF terminated files, and how they behave in SSIS when the package is expecting one type but gets the other. I’ll then demonstrate two workarounds to prevent unexpected errors due to changing line endings.

The scenario

My imaginary client, Contoso, has a new supplier, and I’m setting up the ETL process to import flat file data from this new supplier. I’m working from a file spec that includes column-level mappings and indicates that lines will be terminated using CRLF. I build the package, test it against sample data provided by Contoso’s new supplier, and send the successfully tested package to QA for final testing and deployment to production.

Weeks go by, and the package works great. However, one morning I get a call from Contoso, asking for advice in troubleshooting this new process. It appears that the package has failed without loading any data, logging a data truncation issue upon failure. Both Contoso and their new supplier have reviewed the data file causing the failure, and cannot find any reason for the error. I open the file up in Notepad++ and turn on the Show Line Endings feature, and the problem becomes obvious. The most recently successful file looks like this:

image

However, the failed file looks like this:

image

The difference is subtle but important: The second file uses the line feed character as a terminator, while the previous file uses a carriage return and a line feed. This distinction is not visible when using tools such as Notepad, and in fact, even in Notepad++, these characters aren’t shown by default. However, even though these characters are not visible by default, the distinction is very important.

Why does it matter?

Although they are easy to forget about, incorrect line endings can wreck an ETL process. As shown in the hypothetical scenario above, in cases where the SSIS package expects to receive CRLF line endings but instead gets just an LF, most likely the package will fail due to either data truncation or data type issues. Even worse, if the package is set up to process LF line endings but receives a file with CRLF terminators, chances are good that the data will actually be loaded – with some extra baggage. In the latter case, if the last data field on the line is interpreted as a character data type (CHAR, NVARCHAR, etc.), the carriage return character would be preserved in the loaded data. In the example below, I’ll show how this can impact the quality of that data.

For this example, I’ve created an SSIS package to process a data file using LF line terminators. Then, I regenerate the same data file using CRLF line endings, and process the modified file. The package successfully loads the file, with no apparent problems. Below I can see in my target table that the data has been loaded.

image

Now, I want to find all products matching the first ItemName in the list. When I query this table using the exact ItemName value I just saw in my SSMS results window, I find that I get no results at all.

image

Even though I’m typing in the exact description I see, I get no results for that value. The reason is that I’m looking for the literal string ‘Mountain-100 Black, 42’, when in reality, the value in this field contains an unseen carriage return. Because the SSIS connection was configured to use LF as the line ending, it interprets the carriage return to be part of the data, and loads it to the output table. Copying that value from the SSMS results grid and pasting it into the query window confirms that the extra CR character is present at the end of the data value. Knowing this, I can modify the section criteria I used, changing the query from an exact match to a LIKE with a wildcard at the end to return the values I expected to see.

image

This confirms that the line ending is the problem, but what can be done to avoid this in the first place?

Fixing the Line Ending Problem

When coding to avoid issues with inconsistent line endings, there are three potential scenarios to plan for:

  • Lines with LF line terminators
  • Lines with CRLF line terminators
  • Lines with CR line terminators (a far less common scenario)

Planning for the first two scenarios listed above is relatively easy; the last one takes a bit of work. I’ll demonstrate the design patterns for handling each of these.

Coding for LF and CRLF

As I mentioned earlier, files originally specified as LF endings then getting switched to CRLF (or vice versa) is more common that you might think. However, this problem is fairly easy to resolve using the SSIS data flow. First, the flat file source should be updated to use only a line feed for the line terminator, as shown below.

image

Next, on the data flow, add a derived column transformation to the data pipeline. This transformation will remove any carriage return values (indicated by “\r” in the SSIS expression) found in the last data field.

image

When using this pattern, the output will be the same regardless of whether the lines in the data file are terminated with LF or CRLF. For the latter, the package will simply remove the extra carriage return in the data flow. This is a very easy pattern to implement, and will provide protection against line endings changing from LF to CRLF, or vice versa.

Coding for CR, LF, or CRLF

Building a package to handle any type of line ending – CR, LF, or CRLF – takes a bit more work. Since the SSIS flat file connection manager must be configured for the type of line ending to expect, preparing for line endings that are not known at design time requires a more versatile source: the script component. Using the System.IO namespace in the script component, I can open the file, read through each line, and parse out the values irrespective of the line endings used.

In this example, I’ve added a new script component to the data flow, and I have configured this as a source. Next, I added output columns to the default output on the script component, which match the metadata in the table to which we will be loading the data. Finally, I wrote the code below which will read each line in the file, assigning the values to their respective columns in the output.

The reason this works in the C# code above and not in the flat file source is that C# treats lines within files a bit differently than the SSIS connection manager does. The flat file connection manager in SSIS has to be configured for a specific type of line ending, while the StreamReader.ReadLine() function simply reads to the end of the line irrespective of the line ending used.

Again, it’s been rare that I’ve had to code for the possibility of three different line ending types. However, I have seen this occur a few times, and for volatile data files, it’s a design pattern worth considering.

Conclusion

In the same way data professionals are usually skeptical of untested data, they must also be mindful of suspect metadata. Changes that appear to be as benign as a modification to the line terminator characters can have a serious negative impact on ETL and its target systems. Using the defensive ETL strategies I’ve shown here, you can help prevent errors or data quality issues related to changing line endings in data files.

About the Author

Tim Mitchell
Tim Mitchell is a business intelligence and SSIS consultant who specializes in getting rid of data pain points. Need help with data warehousing, ETL, reporting, or SSIS training? Contact Tim here: TimMitchell.net/contact

27 Comments on "Fix Inconsistent Line Terminators in SSIS"

  1. Thanks for the great article and workaround. Would be so, so much more usable if there was an option to specify ‘Line endings same as C#’, or words to that effect.

  2. Tim, I like your transform option. It’s making me think of some things I could do with subpackages and BIML. In the past, I’ve dealt with it like you via C# but I simply clean the whole file: open it, replace the line terminators with a standard, and resave the file, then I can reuse it with whatever file source it is. But, you could use metadata and drive this via BIML. Great food for thought! Thanks!

    • Tim Mitchell | March 31, 2015 at 9:41 am | Reply

      Thanks for the kind words! Yes, I’ve used the same methodology in the past to clean up awkwardly formatted data files. Fortunately, improvements in the later versions of SSIS have made it easier to do this inline, without the extra step to massage the file.

  3. Nice article Tim and well-though-out resolutions.

    Phil

  4. I have the same issue and tried the first resolution given here. But struck at some point. I have the column with currency data type and not able to put in replace statement.. How can I handle it?

    • If the column is something other than a string data type, you’ll have to import it first as text, run the REPLACE operation, and then convert it to the desired type.

  5. Great approach Tim. We have a client moving from Windows to Linux and the CRLF to LF issue cropped up and I came across this post. Appreciate the time.

  6. I’m really having conversion issues ,what ever you told is not working properly for me regarding currency data type field.

  7. [Data Conversion [514]] Error: Data conversion failed while converting column ” ” (392) to column “Copy of ” (521). The conversion returned status value 2 and status text “The value could not be converted because of a potential loss of data.”.

  8. Thank you – this was most helpful

  9. Hi Tim,
    Can you please let us know how have you defined “vDelimiter” in the C# code?
    I am trying to implement the same code, but it doesn’t like vDelimiter
    Eagerly waiting for your reply

  10. Thomas Rones | March 13, 2018 at 5:57 am | Reply

    Is there a simple way to remove an unwanted LF that is found in what should just be a string? (I’m sure I could just script component with regex, but that seems like a bit much)
    note: DataFlow is SQL Server -> csv

  11. But what about flat files with a header row? If the data has a header row, it too will have a delimiter of {LF} or {CR}{LF}. And we can’t “skip x header rows” if we don’t know the delimiter upfront. Am I missing something?

  12. Your solution to switch between LF and CRLF using an expression is not working. Not sure what i am missing but the package fails at the first step while reading the data in SSIS Flat file source saying that the “Column delimiter is not found”. Can you provide your help ?

  13. Scott Herrmann | October 11, 2018 at 1:34 pm | Reply

    For a fixed-width file, we found that one extra change needed to be made in the data flow. Even after making the changes suggested on this page, the package was still failing due to a truncation error on the last column (the carriage return was interpreted as an extra character). To resolve this issue, open the Flat File Source Editor dialog box found in the data flow, click on the Error Output page, go to the last column, and change the action for a Truncation error to “Ignore failure”.

    Thanks for posting this solution, Tim. It’s helped me quite a few times.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: