As data professionals, there are times when our jobs are relatively easy. Back up the databases. Create the dashboard report. Move the data from flat files to the database. Create documentation. There are lots of cogs in those machines, but an experienced technologist will have little trouble ticking off those boxes. However, those whom we support – clients, end users, executives, coworkers – generally don’t care whether we’ve worked through our technical to-do list. Those folks want exactly one thing from us: data that they can trust. And building that trust is a very hard thing to do – much more difficult than any technical task in front of us.
We Don’t Trust This Data
In my time in consulting – and even before that, when I was a corporate employee – I have heard this phrase all too many times: “We don’t trust this data.”
The lack of trust in data in an organization is cancer. Folks are rarely shy about sharing when they don’t trust the data an organization uses to base its critical decisions on. Once the seed of distrust has been planted, it rarely ever goes away on its own. Further, it has a tendency to spread to other arms of the company, even to others who have no direct reason for distrust.
The fruits of data distrust are plentiful, and rarely positive:
- Extended hesitation to make decisions based on suspicion of the data
- Business folks taking matters into their own hands and creating data silos (hello, Excel Hell!)
- Reports, cubes, and other structures falling into disuse (and eventually, a lack of further development or support)
- Executives and subordinates reverting to manually compiled data for decision-making
It goes without saying that distrust in an organization’s data is very bad. But why does it happen in the first place? And more importantly, what can we technical professionals do to prevent or remedy the situation? The short answer is that there is no short answer. However, to build a plan to reverse (or simply prevent) a pattern of distrust, we must first examine the reasons why trust might have been lost in the first place.
Garbage in, garbage out.
This one is likely the most difficult for data professionals to deal with. When you start with bad data, you rarely end up with perfect data; the best you can do is to end up with a not-as-bad set of data. There are some really great data quality and data cleansing tools on the market, but even the best of breed may not eliminate all of the data issues. I ran into this quite frequently during my healthcare days. The hospital I worked for exchanged data with dozens if not hundreds of vendors, each one with their own standards and practices (many of them involving manual data entry). Needless to say, data quality was a constant challenge, and we had to balance a need for having data as clean as possible with the amount of time it would take to build the logic to cleanse the data. When dealing with sets of data like this, it’s critical to set expectations (more on that shortly).
It’s always late.
Unlike some of the other topics, this one falls squarely in our laps. As the processors of data – or more specifically, the architects of the processes – it is up to us to ensure that users have the data they need, when they need it. As data sets grow over time, the time required to process the data (from ETL to cleansing, and cube processing to report generation) will continue to increase without intervention. It is not enough to simply throw hardware at the problem – we have to be active participants in making those load processes as efficient as possible. Even if the data is correct, it will suffer from some level of distrust if it is not provided in a timely manner.
It’s not clear where these numbers came from.
Data lineage is critical. In most modern analytical systems, data consumers are rarely looking directly at the original transactional data. Instead, they are looking at a copy (or a copy of a copy of a copy…) that has been massaged to fit into the analytical data model. Along with that transformation comes a need to trace back to where that reshaped data originally came from, for auditing and validation purposes. The absence of data lineage is one of the chief deficiencies I find in data warehouse systems. It takes effort to get this right, and it’s especially hard to “bolt it on” after the system is already live. Data lineage is one of those things that is easy to set aside until later, but this technical debt has a high interest rate.
Data inconsistency often appears in organizations that allow self-service reporting. Don’t read this as my saying that self-service data is bad – it’s not. Allowing subject matter experts to directly access data (rather than simply handing off structured reports) will continue to evolve as a means of discovering new patterns in data. That being said, when a company makes the strategic decision to expose analytical structures directly to users, the risk of having inconsistent results is multiplied. When anyone with proper access can connect to reporting tables and create their own reports, it’s entirely possible that two reports may give two different answers to the same question. To overcome this, proper documentation and training is critical. Those with access to underlying tables, views, and cubes must understand the meaning, granularity, and limitations of those structures.
The goalposts are constantly moving.
Although this is not entirely the fault of the technical side of the house, the problem can be magnified if there are no controls over what can be changed. Let’s say you’ve got a structured report that shows P&L by department. One of your department heads complains that her department’s data is being unfairly skewed because of the format of the report. Too often, if that department head makes enough noise, the report will be updated just to satisfy that request. The problem is that now the resulting report indicates something different, not just for that department but for all departments. This is not really a technical problem, but more of a political one. It takes a steady demeanor to know when to push back against unreasonable requests for change.
Nobody owns it.
Some organizations treat data as if it were a fake plastic tree in a dentist’s office – just stick it in the corner and it’ll be good for years. It’s not like that at all. Data, and the processes that support it, is more like a fickle house plant. It requires constant attention: proper sunlight, daily watering, and occasional pruning. If nobody is paying attention to the data or the plumbing that drives it, it’s going to be as useful as an unwatered fern. Each set of data must have a clear owner, both on the technical side as well as in the business unit.
One of the overused phrases I’m trying to banish from my vocabulary is, “It is what it is.” However, that phrase seems applicable here. Often, when dealing with data from outside vendors, closed software systems, or other sources over which we have limited control, there are constraints on the data that can’t easily be overcome. If you want daily sales information but your vendor refuses to provide anything more granular than a weekly summary, you’ll have to find a way to deal with what you have.
When those limitations arise, be clear – both in your communications and your documentation – about the shortcomings of that set of data. And be clear as to the boundaries, too. Communicating with business SMEs or executives about the deficiencies of one particular set of data, emphasize that the limitation doesn’t necessarily affect the remainder of the information available to them. Set expectations early and frequently to avoid distrust issues later.
It’s just wrong.
This one is the big one, and I purposefully saved it for last. Sometimes, the data you’ll receive is simply wrong (see the first bullet above), in which case you’ll want to be sure to fully document and explain this limitation.
All too often, though, the mechanisms that process the data can muck up the data, turning good data into suspect data. The possible causes for this are numerous: incorrect source-to-target mapping, an unhandled exception in the data, incorrect or inconsistent business rules, or simply losing data during ETL processing (yes, it can happen). This is the most critical piece to get right, because those who depend on the data rarely have insight into the internal plumbing that magically transforms flat files into analytical dashboards. It can be very easy for data consumers to mistrust this process – and by extension, the data that comes out of it – simply because it’s a black box from their perspective.
When the data is deemed to be wrong due to ETL or other processing, it’s essential to get out in front of the problem. Communicate what you found and how it was fixed (you’ll have to tailor this message to match the technical aptitude of the audience), and demonstrate that the resulting data truly is corrected after the process change. Follow up to ensure that the issue does not reoccur, and communicate that you are doing so.
The Fickleness of Trust
As noted in the quote at the beginning of this post, granting trust is difficult. It’s even more difficult to regain it after it has been violated. As the curators and protectors of data, those of us tasked with delivering tactical and analytical data must preserve – and occasionally, rebuild – trust in the data we provide. A lack of trust is a tripwire for any organization, and we data professionals must do everything we can to maintain data fidelity for our data consumers.