A compelling feature of the new Data Quality Services in SQL Server 2012 is the ability to apply rules to fields (domains) to describe what makes up a valid value. In this brief post, I’d like to walk through the concepts of domain validation and demonstrate how this can be implemented in DQS.
Domain validation essentials
Let’s ponder domain validation by way of a concrete example. Consider the concept of age: it’s typically expressed in discrete, non-negative whole numbers. However, the expected values of the ages of things will vary greatly depending on the context. An age of 10 years seems reasonable for a building, but sounds ridiculous when describing fossilized remains. A date of “1/1/1950” is a valid date and would be appropriate for classifying a person’s date of birth, but would be out of context if describing when a server was last restarted. In a nutshell, the purpose of domain validation is to allow context-specific rules to provide reasonableness checks on the data.
A typical first step in data validation would involve answering the following questions:
- Is the data of the right type? This helps us to eliminate values such as the number “purple” and the date “3.14159”.
- Does the data have the right precision? This is similar to the point above: If I’m expecting to store the cost of goods at a retail store, I’m probably not going to configure the downstream elements to store a value of $100 million for a single item.
- Is the data present where required? When expressing address data, the first line of an address might be required while a second line could be optional.
Domain validation goes one step further by answering the question, “Is a given value valid when used in this context?” It takes otherwise valid data and validates it to be sure it fits the scenario in play.
Domain validation in DQS
Even if you don’t use this term to describe it, you’re probably already doing some sort of domain validation as part of your ETL or data maintenance routines. Every well-designed ETL system has some measure sanity check to make sure data fits semantically as well as technically.
The downside to many of these domain validation scenarios is that they can be inconsistent and are usually decentralized. Perhaps they are implemented at the outer layer of the ETL before data is passed downstream. Maybe the rules are applied as stored procedures after they are loaded, or even as (yikes!) triggers on the destination tables.
Data Quality Services seeks to remedy the inconsistency and decentralization issue, as well as make the process easier, by way of domain validation rules. When creating a domain in DQS, you are presented with the option of creating domain rules that govern what constitutes a valid value for that domain. For the example below, I’m using data for automobile makes and models, and am implementing a domain rule to constrain the value for the number of doors for a given model.
With the rule created, I can apply one or more conditions to each of the rules. As shown, I am going to constrain the valid values to lie between 1 and 9 inclusive, which should account for the smallest and largest automobile types (such as limousines and buses).
For this rule, I’m setting the conditions that the value must be greater than zero or less than ten. Note that there is no requirement to use this bookend qualification process; you can specify a single qualifier (for example, greater than zero) or have multiple conditions strung together in the same rule. You can even change the AND qualifier to an OR if the rule should be met if either condition is true – though I would caution you when mixing 3 or more conditions using both AND and OR, as the behavior may not yield what you might expect.
That’s all there is to creating a simple domain validation rule. Remember that for the condition qualifiers, you can set greater than, less than, greater than/equal to, etc., for the inclusion rule when dealing with numerical or date domain data types. For string data types, the number of options is even greater, as shown below:
Of particular interest here is that you can leverage regular expressions and patterns to look for partial or pattern matches within the string field. You can also check the string value to see if it can be converted to numeric or date/time.
The rule in action
With the new domain validation rule in place, let’s run some test data through it. I’m going to create a few test records, some of which violate the rule we just created, and run them through a new DQS project using the knowledge base we modified with this rule.
I’ll start off with the dirty data as shown below. You can probably infer that we’ve got a few rows that do not comply with the rule we created, on both ends of the value scale:
After creating a new data cleansing project, I use the data shown above to test the rule constraining the number of doors. As shown below in the New output tab, we have several rows that comply with this new rule:
In addition, there are two distinct values found that do not meet the criteria specified in the new rule. Selecting the Invalid tab, I see the values 0 and 12 have failed validation, as they fall outside the range specified by the rule. In the Reason column, you can see that we get feedback indicating that our new rule is the reason that these records are marked as Invalid:
So by implementing this rule against my data, I am able to validate not only that the value is present and of the correct type, but that it is reasonable for this scenario.
In this post we’ve reviewed the essentials of domain validation and how we can implement these checks through domain rules in SQL Server Data Quality Services. In my next post, I’ll continue the discussion around domain rules by reviewing how these rules can be applied to composite domains in DQS.