I’ve been building ETL processes for many years, and I’ve learned two universal truths: data preparation is messy, and you must always secure your data prep area. A data prep area is very similar to a commercial kitchen, and in the same way that customers aren’t allowed in the kitchen, so should data consumers be kept from accessing in-process data structures.
What Is a Data Prep Area?
First things first, let’s define what a data prep area is. Data preparation (prep) is a common phase of extract, transform, and load (ETL) operations in which data is temporarily written for cleansing, deduplication, reshaping, or other data modifications. Also sometimes referred to as a landing area or a staging area, this is a common design pattern when moving data from a data store optimized for online transaction processing (OLTP) to a data model more friendly to analytics or reporting.
The data prep area really is a lot like a restaurant kitchen: it’s sometimes chaotic, it’s not consumer friendly, and there is a legitimate risk of consuming half-prepared goods.
When Customers Wander Into the Kitchen
In spite of its intended purpose, it is not uncommon for data consumers to have access to a data prep area. And I get it – it’s occasionally handy to let users tinker with raw data, particularly those with wizard-level Excel skills. However, there are tangible risks when opening up the data prep area to end users:
- The data is in an interim state. By definition, the data prep or staging area is intended as temporary storage for in-progress data operations. Allowing data consumers to access this data is akin to serving them a partially-cooked meal. Medium-rare chicken, anyone?
- No single source of truth. The worst possible outcome of any data movement or transformation process is that there are multiple possible data sources providing answers to the same questions. If the data prep area is a temporary stop along the way to the data warehouse or ODS, allowing access to this way station practically guarantees that the data it contains will compete with fully processed and validated data in the DW or ODS.
- User access can interfere with data prep. If users are running expensive queries while ETL processes are running, the performance of both processes can suffer. In a data prep area, ETL processes should have priority.
- Possible security risks. Although it shouldn’t be this way, the data prep area sometimes does not get the same level of security scrutiny as does the data warehouse or other user-facing structures. If users are querying this interim data store directly, they may inadvertently gain access that they normally shouldn’t have.
Secure Your Data Prep Area!
Although it takes some institutional discipline, I highly recommend restricting user access to your data prep area. Doing so helps to eliminate some of the risks of getting bad or conflicting output, and keeps this data landing zone true to its intended purpose.