It’s an odd query, yes, but in preparation to write this post I actually typed the above phrase into my browser. No, I’m certainly not looking to burn down my house. In fact, wait here while I clear my search history, just in case.
For the sake of argument, let’s say you’re planning to fry a turkey over the upcoming Thanksgiving holiday. Think about the research you’d do: What type of equipment will I need? How much oil should I buy? How big should the turkey be? How long should I cook it? All valid queries that should be answered before taking on the task of dropping a frozen bird into boiling oil. But are those the only questions you should ask? Talk to anyone about the dangers of frying a turkey, even those who have never done it, and they’ll tell stories about a brother-in-law, or a coworker, or some guy on YouTube who set ablaze the family homestead in a misguided effort to cook Thanksgiving dinner.
Statistically, it may seem like a silly question to ask. What are the odds that frying this turkey will set my house on fire? All in all, probably pretty low. But it does happen – and if it does, the consequences can be disastrous. So, when taking on this task – especially for the first time – asking questions (What factors make it more likely that this turkey fry will turn into a huge bonfire?) that can help reduce the risk seems like a good investment.
Be a data pessimist
If you’ve met me in person, you probably remember me as a glass-half-full guy. But when it comes to data management, I’m a full-on pessimist. Any data I get is crap until proven otherwise. Every data load process will fail at some point. And, given enough time and iterations, even a simple data movement operation can take down an entire organization. It’s the turkey burning down the house. Yes, the odds of a single data process wreaking havoc on the organization is very, very low, but the impact if realized is very, very high. High enough that it’s worth asking those questions. What part of this process could wreck our financial reporting? What factors make this more likely to happen? How can we mitigate those factors?
For the record, I don’t suggest that we all wear tin-foil hats and prepare for space aliens to corrupt our data. However, there are lots of unlikely-yet-realistic scenarios in almost any process. Think about your most rock-solid data operation right now. What potential edge cases could harm your data infrastructure? Sometimes it’s the things that might seem harmless:
- Is it possible that we could run two separate loads of the exact same data at the same time?
- What if a process extracts data from a file that is still being written to (by a separate process)?
- What if a well-meaning staff member loads a properly formatted but inaccurate data file to the source directory?
Others, while even less likely, could lead to a real mess:
- Is it possible for my data load process to be purposefully executed with inaccurate or corrupt data?
- Could some situation exist within my ETL process that would allow essential rows of data to simply be lost, silently and without errors?
- Do I have any processes that could make unaudited changes to my financial records?
Each potential scenario would have to be evaluated to determine the cost to prevent the issue versus the likelihood of realization and the impact if realized.
Fortunately, most of the data problems we deal with are not as catastrophic as igniting one’s home with a fried turkey on Thanksgiving. However, as data professionals, our first responsibility is to protect the data. We must always pay attention to data risk to ensure that we don’t allow data issues to take the house down.
Threat modeling takes on a whole new meaning in the data space. All data is bad until proven good.
Kevin, definitely so. Trust but verify (and “trust” is optional).
Awesome post. Pessimistic optimism.
Thanks Kaia. I like “pessimistic optimism”. Perhaps my new trademark? 🙂