Those who follow my blog know that I write a lot about data quality. Measuring and improving the quality of data is an important part of any data initiative, especially in the data warehousing space. While data quality does get its share of attention, there is a concept that is equally important but is sadly overlooked during most data projects:
When I speak of metadata hygiene, I’m referring to the fidelity and consistency in which data structures (tables, views, stored procedures, SSIS packages, etc.) identify and process the underlying data. Every data professional has run into at least a few of the following cases of poor metadata hygiene:
- The exact same data stored in two different tables on the same database server to support legacy applications
- Outdated stored procedures that are no longer used
- Highly specific views that could be made generic
- Generically named SSIS packages (I’ve actually seen a few of these simply named “Load Data”)
- Duplicate load processes that should be modularized
- Mix-and-match naming conventions
- Inconsistent uses of schemas and other grouping structures
The above listed cases (and many others in the same realm) have fairly benign beginnings: someone created a one-time load process that was promoted to a scheduled job; a new dev team took over a project; someone simply didn’t know there was already a data structure for that and created a new one. These are functions of different coding styles and time, and they happen in every shop – even those with just a single developer.
Some of these can be fixed in a minute. Others might take an afternoon. Still others would be a major undertaking. I am reminded of a project I worked on several years ago, where the original developer had misspelled a word in the name of a column in a data warehouse table. By the time the misspelling was discovered, most of the load processes for that table had already been built – using the incorrect spelling. Those of us who are insistent on proper spelling even in object names had to adapt and force ourselves to misspell this column name over and over again when interacting with the data warehouse or reports. Yes, the issue could have been fixed in seconds, but when the misspelling was simply accepted and later integrated into load processes, the poor metadata hygiene was permanently stamped into the foundation.
A high degree of data quality and usability can be had even without good metadata hygiene. However, when proper care is given to making sure the metadata accurately and concisely represents the underlying data, the end product becomes much easier to use and understand.
This post was originally published in my Data Geek Newsletter.