When processing data files using SQL Server Integration Services, it is not uncommon to find files with different end-of-line markers for each line in the file. In this post, I will demonstrate how to fix inconsistent line terminators in SSIS to avoid ETL errors.
Fix Inconsistent Line Terminators in SSIS
In every text file, there are unprintable characters called line terminators that mark the end of each line. On most UNIX and Linux systems and some older operating systems, files are created using the line feed character (LF, or ASCII character 10), while files generated in Windows typically use the carriage return and line feed (CRLF, or ASCII characters 13 and 10, respectively) to mark line endings. The tricky part is that these are nonprintable characters, so opening up a data file with Notepad or a similar editor will not present any visual differences between line feed-terminated files and those using carriage return plus line feed. However, these differences can significantly impact the execution of ETL processes and any downstream systems relying on data from those files.
To demonstrate the problem, I’ll show some examples of both LF and CRLF terminated files, and how they behave in SSIS when the package is expecting one type but gets the other. I’ll then demonstrate two workarounds to prevent unexpected errors due to changing line endings.
The scenario
My imaginary client, Contoso, has a new supplier, and I’m setting up the ETL process to import flat file data from this new supplier. I’m working from a file spec that includes column-level mappings and indicates that lines will be terminated using CRLF. I build the package, test it against sample data provided by Contoso’s new supplier, and send the successfully tested package to QA for final testing and deployment to production.
Weeks go by, and the package works great. However, one morning I get a call from Contoso, asking for advice in troubleshooting this new process. It appears that the package has failed without loading any data, logging a data truncation issue upon failure. Both Contoso and their new supplier have reviewed the data file causing the failure, and cannot find any reason for the error. I open the file up in Notepad++ and turn on the Show Line Endings feature, and the problem becomes obvious. The most recently successful file looks like this:
However, the failed file looks like this:
The difference is subtle but important: The second file uses the line feed character as a terminator, while the previous file uses a carriage return and a line feed. This distinction is not visible when using tools such as Notepad, and in fact, even in Notepad++, these characters aren’t shown by default. However, even though these characters are not visible by default, the distinction is very important.
Why does it matter?
Although they are easy to forget about, incorrect line endings can wreck an ETL process. As shown in the hypothetical scenario above, in cases where the SSIS package expects to receive CRLF line endings but instead gets just an LF, most likely the package will fail due to either data truncation or data type issues. Even worse, if the package is set up to process LF line endings but receives a file with CRLF terminators, chances are good that the data will actually be loaded – with some extra baggage. In the latter case, if the last data field on the line is interpreted as a character data type (CHAR, NVARCHAR, etc.), the carriage return character would be preserved in the loaded data. In the example below, I’ll show how this can impact the quality of that data.
For this example, I’ve created an SSIS package to process a data file using LF line terminators. Then, I regenerate the same data file using CRLF line endings, and process the modified file. The package successfully loads the file, with no apparent problems. Below I can see in my target table that the data has been loaded.
Now, I want to find all products matching the first ItemName in the list. When I query this table using the exact ItemName value I just saw in my SSMS results window, I find that I get no results at all.
Even though I’m typing in the exact description I see, I get no results for that value. The reason is that I’m looking for the literal string ‘Mountain-100 Black, 42’, when in reality, the value in this field contains an unseen carriage return. Because the SSIS connection was configured to use LF as the line ending, it interprets the carriage return to be part of the data, and loads it to the output table. Copying that value from the SSMS results grid and pasting it into the query window confirms that the extra CR character is present at the end of the data value. Knowing this, I can modify the section criteria I used, changing the query from an exact match to a LIKE with a wildcard at the end to return the values I expected to see.
This confirms that the line ending is the problem, but what can be done to avoid this in the first place?
Fixing the Line Ending Problem
When coding to avoid issues with inconsistent line endings, there are three potential scenarios to plan for:
- Lines with LF line terminators
- Lines with CRLF line terminators
- Lines with CR line terminators (a far less common scenario)
Planning for the first two scenarios listed above is relatively easy; the last one takes a bit of work. I’ll demonstrate the design patterns for handling each of these.
Coding for LF and CRLF
As I mentioned earlier, files originally specified as LF endings then getting switched to CRLF (or vice versa) is more common that you might think. However, this problem is fairly easy to resolve using the SSIS data flow. First, the flat file source should be updated to use only a line feed for the line terminator, as shown below.
Next, on the data flow, add a derived column transformation to the data pipeline. This transformation will remove any carriage return values (indicated by “\r” in the SSIS expression) found in the last data field.
When using this pattern, the output will be the same regardless of whether the lines in the data file are terminated with LF or CRLF. For the latter, the package will simply remove the extra carriage return in the data flow. This is a very easy pattern to implement, and will provide protection against line endings changing from LF to CRLF, or vice versa.
Coding for CR, LF, or CRLF
Building a package to handle any type of line ending – CR, LF, or CRLF – takes a bit more work. Since the SSIS flat file connection manager must be configured for the type of line ending to expect, preparing for line endings that are not known at design time requires a more versatile source: the script component. Using the System.IO namespace in the script component, I can open the file, read through each line, and parse out the values irrespective of the line endings used.
In this example, I’ve added a new script component to the data flow, and I have configured this as a source. Next, I added output columns to the default output on the script component, which match the metadata in the table to which we will be loading the data. Finally, I wrote the code below which will read each line in the file, assigning the values to their respective columns in the output.
The reason this works in the C# code above and not in the flat file source is that C# treats lines within files a bit differently than the SSIS connection manager does. The flat file connection manager in SSIS has to be configured for a specific type of line ending, while the StreamReader.ReadLine() function simply reads to the end of the line irrespective of the line ending used.
Again, it’s been rare that I’ve had to code for the possibility of three different line ending types. However, I have seen this occur a few times, and for volatile data files, it’s a design pattern worth considering.
Conclusion
In the same way data professionals are usually skeptical of untested data, they must also be mindful of suspect metadata. Changes that appear to be as benign as a modification to the line terminator characters can have a serious negative impact on ETL and its target systems. Using the defensive ETL strategies I’ve shown here, you can help prevent errors or data quality issues related to changing line endings in data files.
Thanks for the great article and workaround. Would be so, so much more usable if there was an option to specify ‘Line endings same as C#’, or words to that effect.
Tim, I like your transform option. It’s making me think of some things I could do with subpackages and BIML. In the past, I’ve dealt with it like you via C# but I simply clean the whole file: open it, replace the line terminators with a standard, and resave the file, then I can reuse it with whatever file source it is. But, you could use metadata and drive this via BIML. Great food for thought! Thanks!
Thanks for the kind words! Yes, I’ve used the same methodology in the past to clean up awkwardly formatted data files. Fortunately, improvements in the later versions of SSIS have made it easier to do this inline, without the extra step to massage the file.
Nice article Tim and well-though-out resolutions.
Phil
Oops! I meant ‘thought’.
Thanks Phil!
I have the same issue and tried the first resolution given here. But struck at some point. I have the column with currency data type and not able to put in replace statement.. How can I handle it?
If the column is something other than a string data type, you’ll have to import it first as text, run the REPLACE operation, and then convert it to the desired type.
Great approach Tim. We have a client moving from Windows to Linux and the CRLF to LF issue cropped up and I came across this post. Appreciate the time.
Thanks Dave! Glad this post helped.
I’m really having conversion issues ,what ever you told is not working properly for me regarding currency data type field.
[Data Conversion [514]] Error: Data conversion failed while converting column ” ” (392) to column “Copy of ” (521). The conversion returned status value 2 and status text “The value could not be converted because of a potential loss of data.”.
The error message indicates that this is a data conversion issue. Check your data types to make sure you are using like-to-like data types in your mappings.
Thank you – this was most helpful
Thanks Karen! Glad it helped.
Hi Tim,
Can you please let us know how have you defined “vDelimiter” in the C# code?
I am trying to implement the same code, but it doesn’t like vDelimiter
Eagerly waiting for your reply
Reddy, the vDelimiter is an SSIS variable, defined in the package and referenced in the C# script task. After reading your comment, I realized I had not called out in the code the fact that I’m using an SSIS variable in this case. I’ll update the code comment to indicate this.
Hi Tim
Can you please update the code and help me out regarding “vDelimiter”
Thank you
Reddy
Reddy, the code has already been updated with the comment referencing the use of an SSIS variable. You can learn more about the use of SSIS variables in a script component here: https://social.technet.microsoft.com/wiki/contents/articles/22198.use-ssis-variables-and-parameters-in-a-script-component.aspx
Hi Tim,
I really appreciate the help. One last question for you 🙂
How are you declaring the “vDelimter” in the SSIS package?
I mean what is the value you are declaring for that value?
this would be my last question 🙂
As I mentioned earlier, the vDelimiter variable is a variable declared in the SSIS package itself, and is referenced in the C# script. Read through the document I linked in the prior comment, and you’ll find how that is being used. If you want some basic information about SSIS variables in general, see this article: https://www.simple-talk.com/sql/ssis/ssis-basics-introducing-variables/
I declared “vDelimiter” as |
when i run my package i am getting “Index was outisde the bounds of the array”
Is there a simple way to remove an unwanted LF that is found in what should just be a string? (I’m sure I could just script component with regex, but that seems like a bit much)
note: DataFlow is SQL Server -> csv
If the CR or LF characters are actually in the data, it becomes much more complicated. You’d most likely have to write a script to properly manage that.
But what about flat files with a header row? If the data has a header row, it too will have a delimiter of {LF} or {CR}{LF}. And we can’t “skip x header rows” if we don’t know the delimiter upfront. Am I missing something?
Your solution to switch between LF and CRLF using an expression is not working. Not sure what i am missing but the package fails at the first step while reading the data in SSIS Flat file source saying that the “Column delimiter is not found”. Can you provide your help ?
For a fixed-width file, we found that one extra change needed to be made in the data flow. Even after making the changes suggested on this page, the package was still failing due to a truncation error on the last column (the carriage return was interpreted as an extra character). To resolve this issue, open the Flat File Source Editor dialog box found in the data flow, click on the Error Output page, go to the last column, and change the action for a Truncation error to “Ignore failure”.
Thanks for posting this solution, Tim. It’s helped me quite a few times.
Thank you so much for this!
Tim if you are still reading this support article, i would like to Say a Big Thank you, we faced a very similar situation and solved this in two mins, thanks to your Article…
Best
Aravind
CEO, BlueCrimson
Aravind, thanks for the feedback! I’m glad it was helpful to you.
I have used this method in an SSIS package and I am currently finding that it assumes the value of the field is blank now rather than NULL which is what I am expecting the result to be. Any suggestions on how to ensure the final value is treated as a NULL instead of blank?
Hi Tim – I have text file with inconsistent data so I need to normalize and load to SQL table could you please help me on this. I just given below for sample formats
FirstName LastName Title
Nancy Davolio Sales Representative
City
Seattle
StateProvince
WA
————————————————————
FirstName LastName Title
Smith Steve Manager
City
Seattle
StateProvince
WA
————————————————————
Thanks,
Ayyappan