Data Quality: The Discovery

I wrote a post a few months back about a healthcare data conversion project that I’ve been working on for the better part of 2 years.  My task on this project is to convert data from an old UNIX-based Universe database to a SQL Server-based application; the database we are extracting from is quite old, both in terms of technology as well as the length of time this application has been in use at this facility.  Universe, and IBM product, is a multivalue database, which in my understanding is like the anti-RDBMS, allowing for multiple values within each row/column intersection.  Getting data stored in this manner extracted and imported to SQL Server has been the quite challenging, but it pales in comparison with the difficulty we have had with the quality of the data itself.

We discovered early in the conversion spec process that we would need to make some significant changes to the data to allow it to fit the data model expected by our new software vendor.  For example, the current application allows registration staff to enter a patient’s employer information by either direct text entry or through a lookup of existing employers in a lookup table.  This was not a problem in itself, as we could reference the lookup table during the conversion to retrieve the employer information.  However, a “feature” of the application is that a person can delete records from the lookup table, and since the patient’s visit information stored only the employer code, that information became orphaned; Bob’s Lumber Yard might be BLMYD or similar jibberish, and unless a tenured employee knew the code by heart, that data was lost to the bit bucket.

Another oddity we found was that the application allowed for the deletion of general ledger accounts, even if those accounts had transactions attached to them.  Now I’m not an accounting guy (at least I wasn’t before this project – I’ve since gotten a crash course) but even I can tell you that orphaning those transactions can throw off your bottom line, and the effects could be felt for many quarters to come.  Even though the transactions in question were several years old and I was able to address them relatively easily in the conversion, the overall quality of the data briefly came into question simply due to this little oddity.

Not all of the data quality issues we found were related to shortcomings of the system.  The most dangerous threat to data quality – user input – remains the most unpredictable wildcard in the entire process.  Staff turnover, training, and administrative policy changes through the years all contributed to inconsistencies in the data, causing us to create complex and often case-specific algorithms to correct.

Through the course of this conversion, we found a number of other similar issues that caused us grief.  However, this old application has been a workhorse and has served its purpose well despite being very difficult to learn and expensive to maintain.  For years, users have been able to use the system effectively on a day-to-day basis, and there is very little downtime.  Reporting, on the other hand, has been the key shortcoming of this dinosaur.  These data quality issues, along with the lack of an intuitive reporting tool, have made reporting from this system all but impossible.  This organization has had to spend a painful amount of staff time creating reports because of the manual work required to do so.  More often than not, the reporting process goes like this:

  • The user generates a text-based report using one of two integrated reporting tools (neither of which is user-friendly or intuitive).
  • Said user runs a copy/paste operation from the text report into Excel.
  • Extraneous text such as page headers, etc., must be manually removed from the Excel file.
  • The user creates extra rows or columns and copies/pastes the data from one cell to another to properly align the data as required for reporting.
  • The user then performs all of the summarizing/sorting on the Excel document.

This is done for most of the critical reports each month.  Some reports must be generated weekly, which of course further increases staff workload.  The most discouraging thing, though, was that many users simply accepted this and didn’t know how much easier reporting could – and should – be.

In this first of what will hopefully be many posts about data quality, I’ve painted sort of a bleak picture – I generally start out on a more positive note, so please bear with me for now.  In future installments I’ll break down some of the particulars I ran into and some common (and a few uncommon) methods that can be used to improve the quality of data.