In my ongoing series on ETL Best Practices, I am illustrating a collection of extract-transform-load design patterns that have proven to be highly effective. In the interest of comprehensive coverage on the topic, I am adding to the list an introductory prequel to address the fundamental question: What is ETL?
What Is ETL?
ETL is shorthand for the extraction, transformation, and loading process used in most data movement operations. ETL processes typically include programmatically retrieving data from one or more sources, performing updates on this data (if needed), and finally the pushing of the data to a destination structure.
ETL processes are found in most every corner of data storage and processing. Small, informal ETL loads can be used to create ad-hoc spreadsheets and aggregated data for simple analytics. More complex enterprise ETL processes are used to feed large data warehouses and similar structures. Most every data professional has performed ETL at one time, even if they didn’t refer to it as such.
When working with ETL operations, you will commonly hear references to the data pipeline (or just “the pipeline”). It’s easy to visualize the ETL process as a pipeline, where raw data comes in and usable information goes out. Along the way, some of the data may be piped elsewhere for cleansing, and other bits might be flagged as bad and sent to a holding area. The data pipeline can be simple and straight, or may have numerous twists, turns, and branches. In most cases this virtual data pipeline exists in RAM on the machine running the ETL operation, but can also be temporarily stored (often in staging or temp tables) as needed.
As the name implies, there are three distinct phases in an ETL operation.
During this phase, data is retrieved from the source and loaded into the pipeline. The source might be a relational database, a file, web service, an IOT device – most anything that can produce or store data can be used as a source in an ETL process. This phase isn’t necessarily limited to one source of data; it is common for data movement operations to use multiple sources.
Although extraction is usually the simplest part of the ETL process, it can grow to be complicated. For example, when processing very large sets of data, the extraction may occur incrementally, requiring selection logic to retrieve a specific subset of data for each load.
The transformation phase is where the data manipulation and branching takes place. Among the tasks that can take place during transformation:
- Data is reshaped to fit the destination
- Data type changes are performed (such as converting text to integer)
- Duplicates are removed from the data
- The data is joined to lookup tables to validate against a trusted list of values
- Extra spaces are trimmed
- Case and capitalization are standardized
- Bad rows of data are flagged and sent down a separate path in the data pipeline
- The granularity of the data is changed, where one input row becomes several output rows or vice versa
- Application of business logic
Not every ETL process needs all of the manipulation listed above; most processes will use a subset of these. The extent to which data is transformed depends on the quality and shape of the source data and how well it meets the needs of business. Some data sets will require minimal transformation, while others need significant rework.
The transformation portion of ETL is where most of the work is done, and is the most challenging and complex part of the process. In fact, for most data movement initiatives, creating the data transformations – including requirements gathering, development, and testing – will be the most time consuming part of the entire project. If there is one bit of project management advice I could give around ETL, it would be this: don’t underestimate the time or effort needed to build data transformations. I’ve seen projects delayed significantly because it was assumed that data transformation needs were minimal.
The loading phase of ETL is the final leg of the process, during which data is written to where it will be permanently stored.
By the time the data reaches the loading phase, all of the updates have been performed and the data is as good as it’s going to get. However, that doesn’t mean that the loading phase is just a simple copy operation. In some ETL processes, the load phase is an upsert operation, which will conditionally insert or update based on whether the rows being loaded already exist in some form in the target. In other cases, there are multiple destinations; for example, if you send bad data to a triage table, the load process will have two endpoints.
Putting It All Together
The three phases of each ETL operation – extract, transform, and load – are usually tightly coupled together. As I noted earlier, the data pipeline typically exists virtually in RAM on the machine performing the ETL, so each load from source to destination typically runs in one contiguous operation. However, ETL processes are typically grouped together into logical units, and executed either in sequence in parallel.
Consider an example of a retail store loading a data warehouse each night with its sales data. There would be an ETL process to load the table of sales transactions, but this table would depend on supporting tables – including customers, sales associates, products, among others – that describe the entities contributing to that transaction. Each of those tables mentioned would have its own ETL process to load the data warehouse. Although the individual loads could be executed on their own, there is little business value in loading just one table. Therefore, each of these ETL processes become smaller parts of a larger orchestrated load, which triggers each load in sequence.
What is ETL? In this post I have briefly addressed that question, describing the overall process while diving into each of the phases of an ETL operation. In a future post, I will also address a similar pattern called ELT (extract-load-transform) and compare these two methodologies.