Fix Inconsistent Line Terminators in SSIS

There is a flat file processing issue I’ve run into a number of times over the years, and it’s come up again several times recently. The issue relates to the line terminators used in data files. Occasionally, changes to the systems generating these data files, or perhaps even manual edits, can change the way the file marks the end of a line. These changes can cause a failure of package execution, or even worse, they can be loaded successfully and cause data quality issues in the target.

In every text file, there are unprintable characters called line terminators that mark the end of each line. On most UNIX and Linux systems and some older operating systems, files are created using the line feed character (LF, or ASCII character 10), while files generated in Windows typically use the carriage return and line feed (CRLF, or ASCII characters 13 and 10, respectively) to mark line endings. The tricky part is that these characters are generally not displayed, so opening up a data file with Notepad or a similar editor will not present any visual differences between line feed-terminated files and those using carriage return plus line feed. However, these differences can have a significant impact on ETL processes, as well as any downstream systems relying on data from those files.

In this post, I’ll show some examples of both LF and CRLF terminated files, and how they behave in SSIS when the package is expecting one type but gets the other. I’ll then demonstrate two workarounds to prevent unexpected errors due to changing line endings.

The scenario

My client, Contoso, has a new supplier, and I’m setting up the ETL process to import some flat file data from this new supplier. I’m working from a file spec that includes column-level mappings and indicates that lines will be terminated using CRLF. I build the package, test it against sample data provided by Contoso’s new supplier, and send the successfully tested package to QA for final testing and deployment to production.

Weeks go by, and the package works great. However, one morning I get a call from Contoso, asking for advice in troubleshooting this new process. It appears that the package has failed without loading any data, logging a data truncation issue upon failure. Both Contoso and their new supplier have reviewed the data file causing the failure, and cannot find any reason for the error. I open the file up in Notepad++ and turn on the Show Line Endings feature, and the problem becomes readily apparent. The most recently successful file looks like this:

image

However, the failed file looks like this:

image

The difference is subtle but important: The second file uses the line feed character as a terminator, while the previous file uses a carriage return. This distinction is not visible when using tools such as Notepad, and in fact, even in Notepad++, these characters aren’t shown by default. However, even though these characters are not visible by default, the distinction is very important.

Why does it matter?

Although they are easy to forget about, incorrect line endings can wreck an ETL process. As shown in the hypothetical scenario above, in cases where the SSIS package expects to receive CRLF line endings but instead gets just an LF, most likely the package will fail due to either data truncation or data type issues. Even worse, if the package is set up to process LF line endings but receives a file with CRLF terminators, chances are good that the data will actually be loaded – with some extra baggage. In the latter case, if the last data field on the line is interpreted as a character data type (CHAR, NVARCHAR, etc.), the carriage return character would be preserved in the loaded data. In the example below, I’ll show how this can impact the quality of that data.

For this example, I’ve created an SSIS package to process a data file using LF line terminators. Then, I regenerate the same data file using CRLF line endings, and process the modified file. The package successfully loads the file, with no apparent problems. Below I can see in my target table that the data has been loaded.

image

Now, I want to find all products matching the first ItemName in the list. When I query this table using the exact ItemName value I just saw in my SSMS results window, I find that I get no results at all.

image

Even though I’m typing in the exact description I see, I get no results for that value. The reason is that I’m looking for the literal string ‘Mountain-100 Black, 42’, when in reality, the value in this field contains an unseen carriage return. Because the SSIS connection was configured to use LF as the line ending, it interprets the carriage return to be part of the data, and loads it to the output table. Copying that value from the SSMS results grid and pasting it into the query window confirms that the extra CR character is present at the end of the data value. Knowing this, I can modify the section criteria I used, changing the query from an exact match to a LIKE with a wildcard at the end to return the values I expected to see.

image

This confirms that the line ending is the problem, but what can be done to avoid this in the first place?

Fixing the Line Ending Problem

When coding to avoid issues with inconsistent line endings, there are three potential scenarios to plan for:

  • Lines with LF line terminators
  • Lines with CRLF line terminators
  • Lines with CR line terminators (a far less common scenario)

Planning for the first two scenarios listed above is relatively easy; the last one takes a bit of work. I’ll demonstrate the design patterns for handling each of these.

Coding for LF and CRLF

As I mentioned earlier, files originally specified as LF endings then getting switched to CRLF (or vice versa) is more common that you might think. However, this problem is fairly easy to resolve using the SSIS data flow. First, the flat file source should be updated to use only a line feed for the line terminator, as shown below.

image

Next, on the data flow, add a derived column transformation to the data pipeline. This transformation will remove any carriage return values (indicated by “\r” in the SSIS expression) found in the last data field.

image

When using this pattern, the output will be the same regardless of whether the lines in the data file are terminated with LF or CRLF. For the latter, the package will simply remove the extra carriage return in the data flow. This is a very easy pattern to implement, and will provide protection against line endings changing from LF to CRLF, or vice versa.

Coding for CR, LF, or CRLF

Building a package to handle any type of line ending – CR, LF, or CRLF – takes a bit more work. Since the SSIS flat file connection manager must be configured for the type of line ending to expect, preparing for line endings that are not known at design time requires a more versatile source: the script component. Using the System.IO namespace in the script component, I can open the file, read through each line, and parse out the values irrespective of the line endings used.

In this example, I’ve added a new script component to the data flow, and I have configured this as a source. Next, I added output columns to the default output on the script component, which match the metadata in the table to which we will be loading the data. Finally, I wrote the code below which will read each line in the file, assigning the values to their respective columns in the output.

public override void CreateNewOutputRows()
    {
        // Create a connection to the file
        StreamReader reader = new StreamReader(Connections.SalesFile.ConnectionString);

        // Skip header row
        reader.ReadLine();

        while (!reader.EndOfStream)
        {
            string line = reader.ReadLine();
            string[] columns = line.Split(Variables.vDelimiter.ToCharArray());

            // Use an increasing integer for indexing the columns below (so I can be lazy and paste below).
            int i = 0;

            // Add a new row to the output for each line in the file
            Output0Buffer.AddRow();

            // Assign the appropriate values into the output columns
            Output0Buffer.SalesOrderID = int.Parse(columns[i++]);
            Output0Buffer.SalesOrderDetailID = int.Parse(columns[i++]);
            Output0Buffer.CarrierTrackingNumber = columns[i++];
            Output0Buffer.OrderQty = sbyte.Parse(columns[i++]);
            Output0Buffer.ProductID = short.Parse(columns[i++]);
            Output0Buffer.SpecialOfferID = sbyte.Parse(columns[i++]);
            Output0Buffer.UnitPrice = float.Parse(columns[i++]);
            Output0Buffer.UnitPriceDiscount = float.Parse(columns[i++]);
            Output0Buffer.LineTotal = float.Parse(columns[i++]);
            Output0Buffer.rowguid = columns[i++];
            Output0Buffer.ModifiedDate = DateTime.Parse(columns[i++]);
            Output0Buffer.ItemName = columns[i++];
        }
    }

The reason this works in the C# code above and not in the flat file source is that C# treats lines within files a bit differently than the SSIS connection manager does. The flat file connection manager in SSIS has to be configured for a specific type of line ending, while the StreamReader.ReadLine() function simply reads to the end of the line irrespective of the line ending used.

Again, it’s been rare that I’ve had to code for the possibility of three different line ending types. However, I have seen this occur a few times, and for volatile data files, it’s a design pattern worth considering.

Conclusion

In the same way data professionals are usually skeptical of untested data, they must also be mindful of suspect metadata. Changes that appear to be as benign as a modification to the line terminator characters can have a serious negative impact on ETL and its target systems. Using the defensive ETL strategies I’ve shown here, you can help prevent errors or data quality issues related to changing line endings in data files.

How to burn down your house while frying a turkey

Linda.  I have lots more photos of this assignment including horizontals if you need it for the web.  Just make contact with me.  Ed will only show you 5 photos so I have  to really edit it down.  It’s an odd query, yes, but in preparation to write this post I actually typed the above phrase into my browser.  No, I’m certainly not looking to burn down my house.  In fact, wait here while I clear my search history, just in case.

For the sake of argument, let’s say you’re planning to fry a turkey over the upcoming Thanksgiving holiday.  Think about the research you’d do: What type of equipment will I need? How much oil should I buy?  How big should the turkey be?  How long should I cook it? All valid queries that should be answered before taking on the task of dropping a frozen bird into boiling oil.  But are those the only questions you should ask?  Talk to anyone about the dangers of frying a turkey, even those who have never done it, and they’ll tell stories about a brother-in-law, or a coworker, or some guy on YouTube who set ablaze the family homestead in a misguided effort to cook Thanksgiving dinner.

Statistically, it may seem like a silly question to ask.  What are the odds that frying this turkey will set my house on fire?  All in all, probably pretty low.  But it does happen – and if it does, the consequences can be disastrous.  So, when taking on this task – especially for the first time – asking questions (What factors make it more likely that this turkey fry will turn into a huge bonfire?) that can help reduce the risk seems like a good investment.

Be a data pessimist

If you’ve met me in person, you probably remember me as a glass-half-full guy.  But when it comes to data management, I’m a full-on pessimist.  Any data I get is crap until proven otherwise.  Every data load process will fail at some point.  And, given enough time and iterations, even a simple data movement operation can take down an entire organization.  It’s the turkey burning down the house.  Yes, the odds of a single data process wreaking havoc on the organization is very, very low, but the impact if realized is very, very high.  High enough that it’s worth asking those questions.  What part of this process could wreck our financial reporting?  What factors make this more likely to happen?  How can we mitigate those factors?

For the record, I don’t suggest that we all wear tin-foil hats and prepare for space aliens to corrupt our data.  However, there are lots of unlikely-yet-realistic scenarios in almost any process.  Think about your most rock-solid data operation right now.  What potential edge cases could harm your data infrastructure?  Sometimes it’s the things that might seem harmless:

  • Is it possible that we could run two separate loads of the exact same data at the same time?
  • What if a process extracts data from a file that is still being written to (by a separate process)?
  • What if a well-meaning staff member loads a properly formatted but inaccurate data file to the source directory?

Others, while even less likely, could lead to a real mess:

  • Is it possible for my data load process to be purposefully executed with inaccurate or corrupt data?
  • Could some situation exist within my ETL process that would allow essential rows of data to simply be lost, silently and without errors?
  • Do I have any processes that could make unaudited changes to my financial records?

Each potential scenario would have to be evaluated to determine the cost to prevent the issue versus the likelihood of realization and the impact if realized.

Fortunately, most of the data problems we deal with are not as catastrophic as igniting one’s home with a fried turkey on Thanksgiving.  However, as data professionals, our first responsibility is to protect the data.  We must always pay attention to data risk to ensure that we don’t allow data issues to take the house down.

DQS Composite Domains and Value Combinations

As I’ve been working with Data Quality Services over the past few weeks, I’ve spent a lot of time working with data domains, composite domains, and rules.  In that exploration, I’ve found some behavior that might not be expected when performing cleansing operations against a knowledge base containing a composite domain.

In this post, I’ll outline the expected data cleansing behavior for composite domain value combinations, and will show how the actual result is not what one would expect in this case.  I’ll also briefly describe a couple of workarounds to address this issue.

Overview

Here’s the layout of the issue at hand.  Composite domains can be created in knowledge bases in DQS, and encompass two or more existing organic domains within the same knowledge base.  Those composite domains can then be leveraged in a data cleansing project; if you engage all of the included domains that are part of a composite, that composite domain will automatically be included as part of the cleansing operation.  Now from here a reasonable person (and by “a reasonable person,” I mean me) could assume that if the composite domain is used as part of the cleansing operation, that it would perform the cleansing operation across the product of the composite domain rather than just the individual domains therein.  However, my experimentation has found otherwise.

Make sense? Don’t worry – if I lost you in the problem description, I think a simple example should bring it back into focus.

Example

I’ve been using automotive data for a lot of my recent DQS samples, so we’ll stick with that for now.  I’ve got a reference set of data with (mostly) valid automobile data that I’m using to build a DQS knowledge base through the knowledge discovery activity.  Included in the reference data are automobiles of various make and model, among them the Chevrolet Camaro and several flavors of Ford automobile (we’ll get back to these specifics in a second).  When I import this information through knowledge discovery, it renders both Ford and Chevrolet as valid automobile makes, and the Camaro is present as a valid model of automobile.

image

Now, I want to create an association between make and model, since model is mostly dependent on make.  I create a new composite domain in my knowledge base, and use the combination of Make and Model domains to build this new composite domain.

image

With that done, I’ll republish the knowledge base, and we’re good to go.  Next, I’ll create a DQS cleansing project that will leverage the knowledge base we’ve built with this automobile data.  I’m going to use a smaller and dirtier set of data to run through the cleansing process.  This data will also bring to light a counterexample of the expected behavior of the composite domain.

When I wire up the table containing the dirty data to the new cleansing project, I get the option of including the composite domain since I’m leveraging both of the elements of that composite domain against the data to be cleansed.  By clicking the View/Select Composite Domain button I can see that the Make and Model composite domain is used by default.

SNAGHTML748b8ba

Before I run the cleansing operation on this DQS project, let’s peek at the data we’ll be cleansing in this new project:

image

You’ll see that I called out a particular entry, and it’s probably clear why I referenced the Camaro earlier.  In our dirty data we have a Ford (valid make) Camaro (valid model), but there’s no such thing as a Ford Camaro in production or in our knowledge base.  When the make and model domains are individually verified, this record would be expected to go through the cleansing process with no changes at all.  However, because we’ve got a composite domain set up to group together the make and model, I expect this to fall out as a new entry (rather than a match to something existing in the knowledge base) since our KB does not have the Make and Model combination of Ford Camaro.

However, when I run the cleansing operation and review the results, what I find is not what I expected:

image

Under the Make and Model composite domain results (notice the individual Make and Model domains are not present, since we’ve engaged the composite domain), I find that the incorrect Ford Camaro entry is shown, but instead of showing up under the New tab, it instead surfaces in the Correct tab indicating that the value is already present in the knowledge base.  Given that the displayed reason indicates a “Domain value” match, this seems to indicate that, despite the use of the composite domain, the individual domains are instead being used for aligning the cleansed data with the information in the knowledge base.

Workarounds?

Ideally, what we’d see is the Ford Camaro entry pushed to the New tab since there is no such combination in the KB.  However, there are a few limited options to work around this.

First, you could create a separate field containing the entire make and model combination in your source data, and perform the Make + Model validation against the single field.  This is probably the most realistic workaround as it doesn’t require a lot of static rules.  However, it still means that you will likely need to reengineer the way you stage the data.  It’s a generally accepted practice to store data elements as atomic units, and building a Make + Model field limits your options or forces you to undo that same operation later in the ETL.

You also have the option to create rules against your composite domains to set if/then scenarios for data validation.  For example, you could create a rule that dictates that if the car is a Camaro, the make must be Chevrolet.  However, unless the cardinality of your data is very, very low, don’t do this.  Creating static rules to deal with data semantics is like pulling at a loose thread on a sweater: you’ll never find the end of it, and it’ll just make a mess in the meantime.

Resolution

I’d like to see this behavior fixed, as I think it will lead to confusion and a lot of extra work on the part of data quality and ETL professionals.  I’ve created a Connect bug report to address this behavior, and I’m hopeful that we’ll see a change in this behavior in a future update or service pack.  Feel free to add your vote or comments to the Connect item if you think the change I describe would be useful.

Conclusion

In this post, I’ve highlighted the unexpected behavior of composite domains in data cleansing operations, along with a few workarounds to help you get past this issue.  As always, comments and alternative suggestions are welcome!

DQS Validation Rules on Composite Domains

In Data Quality Services, composite domains can be created to associate together two or more natural domains within a knowledge base.  Like natural domains, composite domains can also contain one or more validation rules to govern which domain values are valid.  In my last post, I discussed the use of validation rules against natural domains.  In this post, I’ll continue the thread by covering the essentials of composite domain rules and demonstrating how these can be used to create relationships between data domains.

What is a composite domain?

Before we break off into discussing the relationships between member domains of a composite domain, we’ll first touch on the essentials of the latter.

Simply, a composite domain is a wrapper for two or more organic domains in a knowledge base.  Think of a composite domain as a virtual collection of dissimilar yet related properties.  As best I can tell, the composite domain is not materialized in the DQS knowledge base, but is simply a meta wrapper pointing to the underlying values.

To demonstrate, I’ve created a knowledge base using a list of automobile makes and models, along with a few other properties (car type and seating capacity).  I should be able to derive a loose association between automobile type and seating capacity, so I’ll create a composite domain with those two domains as shown below.

image

As shown above, creating a composite domain requires nothing more than selecting two or more domains from an existing knowledge base.  After the composite domain has been created, your configuration options are generally limited to attaching the composite domain to a reference data provider (which I’ll cover in a future post) and adding composite domain rules.

Value association via composite domain rules

The most straightforward way to associate the values of a composite domain is to create one or more rules against that composite domain.  When created against a composite domain, you can use rules to declare if/then scenarios to describe allowable combinations therein.

Back in the day, before marriage, kids, and a mortgage, I used to drive sports cars.  Even though that was a long time ago, I do remember a few things about that type of automobile: they are fast, expensive to insure, and don’t have a lot of passenger capacity.  It’s on that last point that we’ll focus our data quality efforts for now.  I want to make sure that some sneaky manufacturer doesn’t falsely identify as a sports car some big and roomy 4-door sedan.  Therefore, I’m going to create a rule that will restrict the valid domain values for seating capacity for sports cars.

I’ll start with some business assumptions.  What’s the minimum number of seats a sports car should have?  I think it’s probably 2, but I suppose if some enterprising gearhead decided to convert an Indy Car into a street-legal machine, it would likely be classified as a sports car too.  Therefore, it would be reasonable to assume that, in an edge case, a sports car could have just a single seat, so our minimum seating capacity for a sports car would be 1.  On the high side, design of sports cars should dictate that there aren’t many seats.  For example, the Chevrolet Camaro I had in high school could seat 4 people, assuming that 2 of the people were small children with stunted growth who had no claustrophobic tendencies.  However, we can give a little on this issue and assume that they somehow manage to squeeze a third rows of seats into a Dodge Magnum, so we’ll say that a sports car can have a maximum seating capacity of 6 people.

Now, with that information in hand, I’m going to use the Domain Management component of the DQS client to set up the new rule against the “Type and Capacity” composite domain from above.  As shown below, I can set value-specific constraints on the seating capacity based on the automobile type of Sports Car.

image

As shown, any valid record with a car type of Sports Car must have a seating capacity of between 1 and 6 persons.

Of course, sports cars aren’t the only types of automobiles (gasp!), so this approach would likely involve multiple rules.  Fortunately, composite domains allow for many such rules, which would permit the creation of additional restrictions for other automobile types.  You could also expand the Sports Car rule and add more values on the left side of the operator (the if side of the equation).  For example, you might call this instead a “Small Car rule” and include both sports cars and compact cars in this seating capacity restriction.

Other uses

Although we’ve limited our exploration to simply interrogating the value of the natural domains within a composite domain, this is by no means our only option for validation.  For example, when dealing with string data you can inspect the length of the string, search for patterns, use regular expressions, and test for an empty string in addition to checking against the actual value.  Shown below are some of the options you can use to query against a string value in a domain rule.

image

When dealing with date or numerical data, you have the expected comparison operators including less than, greater than, less than or equal to, etc.

Conclusion

This post has briefly explored composite domains and shown how to add validation rules to a composite domain in an existing knowledge base.  In my next DQS post, I’ll continue with composite domains to illustrate a potential misunderstanding in the way composite domains treat value combinations in cleansing operations.

DQS Domain Validation Rules

dqsA compelling feature of the new Data Quality Services in SQL Server 2012 is the ability to apply rules to fields (domains) to describe what makes up a valid value.  In this brief post, I’d like to walk through the concepts of domain validation and demonstrate how this can be implemented in DQS.

Domain validation essentials

Let’s ponder domain validation by way of a concrete example.  Consider the concept of age: it’s typically expressed in discrete, non-negative whole numbers.  However, the expected values of the ages of things will vary greatly depending on the context.  An age of 10 years seems reasonable for a building, but sounds ridiculous when describing fossilized remains.  A date of “1/1/1950” is a valid date and would be appropriate for classifying a person’s date of birth, but would be out of context if describing when a server was last restarted.  In a nutshell, the purpose of domain validation is to allow context-specific rules to provide reasonableness checks on the data.

A typical first step in data validation would involve answering the following questions:

  • Is the data of the right type?  This helps us to eliminate values such as the number “purple” and the date “3.14159”.
  • Does the data have the right precision? This is similar to the point above: If I’m expecting to store the cost of goods at a retail store, I’m probably not going to configure the downstream elements to store a value of $100 million for a single item.
  • Is the data present where required?  When expressing address data, the first line of an address might be required while a second line could be optional.

Domain validation goes one step further by answering the question, “Is a given value valid when used in this context?”  It takes otherwise valid data and validates it to be sure it fits the scenario in play.

Domain validation in DQS

Even if you don’t use this term to describe it, you’re probably already doing some sort of domain validation as part of your ETL or data maintenance routines.  Every well-designed ETL system has some measure sanity check to make sure data fits semantically as well as technically.

The downside to many of these domain validation scenarios is that they can be inconsistent and are usually decentralized.  Perhaps they are implemented at the outer layer of the ETL before data is passed downstream.  Maybe the rules are applied as stored procedures after they are loaded, or even as (yikes!) triggers on the destination tables.

Data Quality Services seeks to remedy the inconsistency and decentralization issue, as well as make the process easier, by way of domain validation rules.  When creating a domain in DQS, you are presented with the option of creating domain rules that govern what constitutes a valid value for that domain.  For the example below, I’m using data for automobile makes and models, and am implementing a domain rule to constrain the value for the number of doors for a given model.

SNAGHTML612bbb0

With the rule created, I can apply one or more conditions to each of the rules.  As shown, I am going to constrain the valid values to lie between 1 and 9 inclusive, which should account for the smallest and largest automobile types (such as limousines and buses).

SNAGHTML6292bf0

For this rule, I’m setting the conditions that the value must be greater than zero or less than ten.  Note that there is no requirement to use this bookend qualification process; you can specify a single qualifier (for example, greater than zero) or have multiple conditions strung together in the same rule.  You can even change the AND qualifier to an OR if the rule should be met if either condition is true – though I would caution you when mixing 3 or more conditions using both AND and OR, as the behavior may not yield what you might expect.

That’s all there is to creating a simple domain validation rule.  Remember that for the condition qualifiers, you can set greater than, less than, greater than/equal to, etc., for the inclusion rule when dealing with numerical or date domain data types.  For string data types, the number of options is even greater, as shown below:

image

Of particular interest here is that you can leverage regular expressions and patterns to look for partial or pattern matches within the string field.  You can also check the string value to see if it can be converted to numeric or date/time.

The rule in action

With the new domain validation rule in place, let’s run some test data through it.  I’m going to create a few test records, some of which violate the rule we just created, and run them through a new DQS project using the knowledge base we modified with this rule.

I’ll start off with the dirty data as shown below.  You can probably infer that we’ve got a few rows that do not comply with the rule we created, on both ends of the value scale:

image

After creating a new data cleansing project, I use the data shown above to test the rule constraining the number of doors.  As shown below in the New output tab, we have several rows that comply with this new rule:

SNAGHTML65f97ec

In addition, there are two distinct values found that do not meet the criteria specified in the new rule.  Selecting the Invalid tab, I see the values 0 and 12 have failed validation, as they fall outside the range specified by the rule.  In the Reason column, you can see that we get feedback indicating that our new rule is the reason that these records are marked as Invalid:

SNAGHTML661229e

So by implementing this rule against my data, I am able to validate not only that the value is present and of the correct type, but that it is reasonable for this scenario.

Conclusion

In this post we’ve reviewed the essentials of domain validation and how we can implement these checks through domain rules in SQL Server Data Quality Services.  In my next post, I’ll continue the discussion around domain rules by reviewing how these rules can be applied to composite domains in DQS.

Data Quality: The Discovery

I wrote a post a few months back about a healthcare data conversion project that I’ve been working on for the better part of 2 years.  My task on this project is to convert data from an old UNIX-based Universe database to a SQL Server-based application; the database we are extracting from is quite old, both in terms of technology as well as the length of time this application has been in use at this facility.  Universe, and IBM product, is a multivalue database, which in my understanding is like the anti-RDBMS, allowing for multiple values within each row/column intersection.  Getting data stored in this manner extracted and imported to SQL Server has been the quite challenging, but it pales in comparison with the difficulty we have had with the quality of the data itself.

We discovered early in the conversion spec process that we would need to make some significant changes to the data to allow it to fit the data model expected by our new software vendor.  For example, the current application allows registration staff to enter a patient’s employer information by either direct text entry or through a lookup of existing employers in a lookup table.  This was not a problem in itself, as we could reference the lookup table during the conversion to retrieve the employer information.  However, a “feature” of the application is that a person can delete records from the lookup table, and since the patient’s visit information stored only the employer code, that information became orphaned; Bob’s Lumber Yard might be BLMYD or similar jibberish, and unless a tenured employee knew the code by heart, that data was lost to the bit bucket.

Another oddity we found was that the application allowed for the deletion of general ledger accounts, even if those accounts had transactions attached to them.  Now I’m not an accounting guy (at least I wasn’t before this project – I’ve since gotten a crash course) but even I can tell you that orphaning those transactions can throw off your bottom line, and the effects could be felt for many quarters to come.  Even though the transactions in question were several years old and I was able to address them relatively easily in the conversion, the overall quality of the data briefly came into question simply due to this little oddity.

Not all of the data quality issues we found were related to shortcomings of the system.  The most dangerous threat to data quality – user input – remains the most unpredictable wildcard in the entire process.  Staff turnover, training, and administrative policy changes through the years all contributed to inconsistencies in the data, causing us to create complex and often case-specific algorithms to correct.

Through the course of this conversion, we found a number of other similar issues that caused us grief.  However, this old application has been a workhorse and has served its purpose well despite being very difficult to learn and expensive to maintain.  For years, users have been able to use the system effectively on a day-to-day basis, and there is very little downtime.  Reporting, on the other hand, has been the key shortcoming of this dinosaur.  These data quality issues, along with the lack of an intuitive reporting tool, have made reporting from this system all but impossible.  This organization has had to spend a painful amount of staff time creating reports because of the manual work required to do so.  More often than not, the reporting process goes like this:

  • The user generates a text-based report using one of two integrated reporting tools (neither of which is user-friendly or intuitive).
  • Said user runs a copy/paste operation from the text report into Excel.
  • Extraneous text such as page headers, etc., must be manually removed from the Excel file.
  • The user creates extra rows or columns and copies/pastes the data from one cell to another to properly align the data as required for reporting.
  • The user then performs all of the summarizing/sorting on the Excel document.

This is done for most of the critical reports each month.  Some reports must be generated weekly, which of course further increases staff workload.  The most discouraging thing, though, was that many users simply accepted this and didn’t know how much easier reporting could – and should – be.

In this first of what will hopefully be many posts about data quality, I’ve painted sort of a bleak picture – I generally start out on a more positive note, so please bear with me for now.  In future installments I’ll break down some of the particulars I ran into and some common (and a few uncommon) methods that can be used to improve the quality of data.