In my ongoing series on ETL Best Practices, I am illustrating a collection of extract-transform-load design patterns that have proven to be highly effective. In the interest of comprehensive coverage on the topic, I am adding to the list an introductory prequel to address the fundamental question: What is ETL?
What Is ETL?
ETL is shorthand for the extraction, transformation, and loading process used in most data movement operations. These processes typically include programmatically retrieving data from one or more sources, performing updates on this data (if needed), and finally the pushing of the data to a destination structure.
ETL processes are found in most every corner of data storage and processing. Small, informal loads can be used to create ad-hoc spreadsheets and aggregated data for simple analytics. More complex enterprise ETL processes are used to feed large data warehouses and similar structures. Most every data professional has performed ETL at one time, even if they didn’t refer to it as such.
When working with ETL operations, you will commonly hear references to the data pipeline (or just “the pipeline”). It’s easy to visualize the such a process as a pipeline, where raw data comes in and usable information goes out. Along the way, some of the data may be piped elsewhere for cleansing, and other bits might be flagged as bad and sent to a holding area. The data pipeline can be simple and straight, or may have numerous twists, turns, and branches. In most cases this virtual data pipeline exists in RAM on the machine running the ETL operation, but can also be temporarily stored (often in staging or temp tables) as needed.
As the name implies, there are three distinct phases in an ETL operation.
During this phase, data is retrieved from the source and loaded into the pipeline. The source might be a relational database, a file, web service, an IOT device – most anything that can produce or store data can be used as a source in an ETL process. This phase isn’t necessarily limited to one source of data; it is common for data movement operations to use multiple sources.
Although extraction is usually the simplest part of the ETL process, it can grow to be complicated. For example, when processing very large sets of data, the extraction may occur incrementally, requiring selection logic to retrieve a specific subset of data for each load.
The transformation phase is where the data manipulation and branching takes place. Among the tasks that can take place during transformation:
- Data reshaping
- Change of data granularity, where one input row becomes several output rows or vice versa
- Data type changes (such as converting text to integer)
- Elimination of duplicates
- Use lookup tables to validate against a trusted list of values
- Remove trailing whitespace or other unprintable characters
- Standardize case and capitalization
- Identify and redirect bad or suspect rows of data
- Application of business logic
Not every ETL process needs all of the manipulation listed above; most processes will use a subset of these. The extent to which data is transformed depends on the quality and shape of the source data and how well it meets the needs of business. Some data sets will require minimal transformation, while others need significant rework.
The transformation portion of ETL is the most challenging and complex part of the process. In fact, for most data movement initiatives, creating the data transformations – including requirements gathering, development, and testing – is the most time consuming part of the entire project. If there is one bit of project management advice I could give around ETL, it would be this: don’t underestimate the time or effort needed to build data transformations. I’ve seen projects delayed significantly because architects or developers significantly underestimated the time required for building and testing transformations.
The loading phase of ETL is the final leg of the process. This phase pushes the data to the structures where it will be permanently stored.
By the time the data reaches the loading phase of ETL, all of the updates are complete and the data is as good as it’s going to get. However, that doesn’t mean that the loading phase is just a simple copy operation. In some ETL processes, the load phase is an upsert operation, which will conditionally insert or update based on whether the rows being loaded already exist in some form in the target. In other cases, there are multiple destinations; for example, if you send bad data to a triage table, the load process will have two endpoints.
Putting It All Together
The three phases of each extract, transform, and load operation are usually tightly coupled together. As I noted earlier, the data pipeline typically exists virtually in RAM on the machine performing the ETL, so each load from source to destination typically runs in one contiguous operation.
Consider an example of a retail store loading a data warehouse each night with its sales data. There would be an ETL process to load the table of sales transactions, but this table would depend on supporting tables – including customers, sales associates, products, among others – that describe the entities contributing to that transaction. Each of those tables mentioned would have its own ETL process to load the data warehouse. Although the individual loads could be executed on their own, there is little business value in loading just one table. Therefore, each of these ETL processes become smaller parts of a larger orchestrated load, which triggers each load in sequence.
When building an extract, transform, and load pipeline, you’ll have the option to build your own or buy a tool specifically designed for this type of operation. There are several commercially available ETL tools, including:
Should you build your own ETL processes from scratch, or buy a tool? If the latter, which tool should you buy? As always, It Depends. You should consider data volume, technical staff knowledge, and existing vendor relationships when making these decisions.
What is ETL? In this post I have briefly addressed that question, describing the overall process while diving into each of the phases of an ETL operation. In a future post, I will also address a similar pattern called ELT (extract-load-transform) and compare these two methodologies.