Extract, Transform, and Load (ETL) processes are the centerpieces in every organization’s data management strategy. Each step the in the ETL process – getting data from various sources, reshaping it, applying business rules, loading to the appropriate destinations, and validating the results – is an essential cog in the machinery of keeping the right data flowing. Establishing a set of ETL best practices will make these processes more robust and consistent.
Over the course of 10+ years I’ve spent moving and transforming data, I’ve found a score of general ETL best practices that fit well for most every load scenario. Following these best practices will result in load processes with the following characteristics:
Most of the examples I flesh out are shown using SQL Server Integration Services. However, the design patterns below are applicable to processes run on any architecture using most any ETL tool. So whether you’re using SSIS, Informatica, Talend, good old-fashioned T-SQL, or some other tool, these patterns of ETL best practices will still apply.
I’m careful not to designate these best practices as hard-and-fast rules. Even for concepts that seem fundamental to the process (such as logging), there will certainly be edge cases that negate the need for one or more of these. However, for most ETL processes, the best practices detailed below should be considered central to the architecture.
Below I’ve listed some of the essentials that are key to most any ETL implementation. In the coming weeks and months, I’ll be blogging about each of these in detail.
ETL Best Practices
What is ETL? For those new to ETL, this brief post is the first stop on the journey to best practices.
Logging: A proper logging strategy is key to the success of any ETL architecture. In this post, I share some of the essential concepts around logging ETL operations.
Auditing. A load without errors is not necessarily a successful load. A well-designed process will not only check for errors but also support auditing of row counts, financial amounts, and other metrics.
Data Lineage. Understanding where data originated from, when it was loaded, and how it was transformed is essential for the integrity of the downstream data and the process that moves it there.
ETL Modularity. Creating reusable code structures is important in most development realms, and even more so in ETL processes. ETL modularization helps avoid writing the same difficult code over and over, and reduces the total effort required to maintain the ETL architecture.
ETL Atomicity. How big should each ETL process be? In this post, I discuss the merits of properly sizing your ETL logic.
Error Handling. What happens when things go wrong? This post reviews design patterns around prevention and management of errors in ETL processes.
Managing Bad Data. When suspect data is discovered, there needs to be a system for cleansing or otherwise managing nonconforming rows of data. In this post, I share some of the design patterns for handling bad data.
Get Your Email Out Of My ETL. There is a proper tool for every job. Embedding email notifications directly in ETL processes adds unnecessary complexity and potential failure points.
Using ETL Staging Tables. Often, the use of interim staging tables can improve the performance and reduce the complexity of ETL processes.
Secure Your Data Prep Area. A staging or landing area for data currently being processed should not be accessible by data consumers. Otherwise, you could end up with bad data, conflicting analytics, or potential security risks.