If data is a train, then metadata is the track on which it travels. A good metadata definition in ETL processes will help to ensure that the flow of the data is predictable, robust, and is properly constrained to avoid errors. However, many ETL processes take a hands-off approach when it comes to metadata. In some cases, this laissez-faire design allows for easy “quick and dirty” data analysis. However, most enterprise ETL processes will benefit from good metadata to help keep the train running on time and on track.
What is metadata?
First, let’s start with the basic definition: metadata is literally data about data. Metadata describes the structure and other properties of some data object. Among other things, metadata helps to define:
- The data set(s) that will be used. This can include the type of data set (files, tables, web services endpoint, etc.) as well as extended information about each (connection string, creation date, etc.).
- The individual fields within those data sets. This includes column names, data types, and data length.
- Data constraints. Uniqueness constraints, nullability settings, and field validation (such as requiring a numerical field to be within a certain range) are examples of these.
- Extended properties for binary data types. If, for example, you are handling image files, metadata can reveal detailed information such as location, camera type and settings. This category of metadata tends to be highly specific to the particular file type being moved.
ETL Antipattern: lazy metadata
Metadata management in ETL processes can be challenging. When exchanging data between systems and across different mediums (unstructured data, flat files, XML, or relational data), the definition of data types can get blurry. Some sources – including unstructured data and flat files – have little if any metadata configuration. On the other end of the spectrum is relational (RDBMS) data, which strictly defines data types and constraints.
Because of the challenges of this exchange process, many developers opt for an approach that I refer to as lazy metadata. In this design pattern, the ETL pipeline is configured very generically, using little if any metadata definition beyond that which is absolutely necessary for data movement. Some of the common artifacts of this lazy metadata approach include:
- Nondescript names for connection endpoints. These might be simply be named according to the type of data structure (“CSV file”, “Oracle database”), or may be even more frustratingly vague (“Source1”, “Destination1”).
- Default field names for those without explicit definition (“Column1”, “Column2”)
- Wide text data types for every field
- No constraints or checks on the data in the ETL pipeline
There are degrees of lazy metadata. In some cases, good object and column names are used but data types are all set to wide nullable text fields. Still others have proper data types defined but neglect the constraints necessary to check for uniqueness or foreign key values until the data is loaded to its final destination. And in some examples, the ETL pipeline processes the data exactly as it arrives from the source and uses that same vague metadata to the destination endpoint, resulting in unhelpfully generic object names forever persisted in the downstream databases.
Benefits of good metadata
Building useful metadata around your data loads does take some time and effort. This can be particularly tedious with endpoints that don’t have their own intrinsic metadata, such as headerless CSV files or unstructured text data. However, in most cases it is worth the effort required to create good object names, data types, and constraints. Having good metadata definition can accomplish several goals:
Identify and handle data problems early in the pipeline
By using proper data types and constraint checking, you can detect and then handle (through data correction, isolation, or triage) the error earlier in the process. This makes the resulting ETL process more robust and predictable, and makes it easier to adapt over time to new data anomalies.
Prevent inconsistencies in the destination systems
Related to the point above, if you rely on the constraints of the final destination table(s), you’re eventually going to have partially loaded destination tables due to the ETL process having failed in the middle of the load due to a constraint on that table. Even worse, if the table to which you’re loading data has no constraints (which is in itself a deficient design, but that’s a conversation for another day), you could find that the data loads are succeeding but are loading bad data.
The resulting data is easier to query
For example, if you are loading numeric data or dates as text, you’ll have to convert that data to its proper data type before doing any real queries against it.
It makes your ETL code easier to read
With proper names on data endpoints, tables, and fields, the ETL code becomes self-documenting and is much easier to understand. Your fellow developers – and perhaps Future You – will thank you.
Troubleshooting and debugging are much easier
Have you ever had to investigate the execution logs for an ETL process with object names such as “Database connection 1”, “Table”, or “Load Data”? That alone is enough justification for proper metadata naming.
Mind your metadata
There are some cases where rigid metadata constraints don’t add a lot of value, such as when loading unstructured data to volatile staging tables for further processing in subsequent steps in the ETL. Whether it’s in the initial staging load or a subsequent ETL task, using good metadata such as sensible names, data types, and constraints makes data load processes processes more robust and easier to understand.