As I’ve been working with Data Quality Services over the past few weeks, I’ve spent a lot of time working with data domains, DQS composite domains, and rules. In that exploration, I’ve found some behavior that might not be expected when performing cleansing operations against a knowledge base containing a composite domain.
In this post, I’ll outline the expected data cleansing behavior for composite domain value combinations, and will show how the actual result is not what one would expect in this case. I’ll also briefly describe a couple of workarounds to address this issue.
DQS Composite Domains
Here’s the layout of the issue at hand. Composite domains can be created in knowledge bases in DQS, and encompass two or more existing organic domains within the same knowledge base. Those composite domains can then be leveraged in a data cleansing project; if you engage all of the included domains that are part of a composite, that composite domain will automatically be included as part of the cleansing operation. Now from here a reasonable person (and by “a reasonable person,” I mean me) could assume that if the composite domain is used as part of the cleansing operation, that it would perform the cleansing operation across the product of the composite domain rather than just the individual domains therein. However, my experimentation has found otherwise.
Make sense? Don’t worry – if I lost you in the problem description, I think a simple example should bring it back into focus.
I’ve been using automotive data for a lot of my recent DQS samples, so we’ll stick with that for now. I’ve got a reference set of data with (mostly) valid automobile data that I’m using to build a DQS knowledge base through the knowledge discovery activity. Included in the reference data are automobiles of various make and model, among them the Chevrolet Camaro and several flavors of Ford automobile (we’ll get back to these specifics in a second). When I import this information through knowledge discovery, it renders both Ford and Chevrolet as valid automobile makes, and the Camaro is present as a valid model of automobile.
Now, I want to create an association between make and model, since model is mostly dependent on make. I create a new composite domain in my knowledge base, and use the combination of Make and Model domains to build this new composite domain.
With that done, I’ll republish the knowledge base, and we’re good to go. Next, I’ll create a DQS cleansing project that will leverage the knowledge base we’ve built with this automobile data. I’m going to use a smaller and dirtier set of data to run through the cleansing process. This data will also bring to light a counterexample of the expected behavior of the composite domain.
When I wire up the table containing the dirty data to the new cleansing project, I get the option of including the DQS composite domain since I’m leveraging both of the elements of that composite domain against the data to be cleansed. By clicking the View/Select Composite Domain button I can see that the Make and Model composite domain is used by default.
Before I run the cleansing operation on this DQS project, let’s peek at the data we’ll be cleansing in this new project:
You’ll see that I called out a particular entry, and it’s probably clear why I referenced the Camaro earlier. In our dirty data we have a Ford (valid make) Camaro (valid model), but there’s no such thing as a Ford Camaro in production or in our knowledge base. When the make and model domains are individually verified, this record would be expected to go through the cleansing process with no changes at all. However, because we’ve got a DQS composite domain set up to group together the make and model, I expect this to fall out as a new entry (rather than a match to something existing in the knowledge base) since our KB does not have the Make and Model combination of Ford Camaro.
However, when I run the cleansing operation and review the results, what I find is not what I expected:
Under the Make and Model composite domain results (notice the individual Make and Model domains are not present, since we’ve engaged the composite domain), I find that the incorrect Ford Camaro entry is shown, but instead of showing up under the New tab, it instead surfaces in the Correct tab indicating that the value is already present in the knowledge base. Given that the displayed reason indicates a “Domain value” match, this seems to indicate that, despite the use of the DQS composite domain, the individual domains are instead being used for aligning the cleansed data with the information in the knowledge base.
Ideally, what we’d see is the Ford Camaro entry pushed to the New tab since there is no such combination in the KB. However, there are a few limited options to work around this.
First, you could create a separate field containing the entire make and model combination in your source data, and perform the Make + Model validation against the single field. This is probably the most realistic workaround as it doesn’t require a lot of static rules. However, it still means that you will likely need to reengineer the way you stage the data. It’s a generally accepted practice to store data elements as atomic units, and building a Make + Model field limits your options or forces you to undo that same operation later in the ETL.
You also have the option to create rules against your composite domains to set if/then scenarios for data validation. For example, you could create a rule that dictates that if the car is a Camaro, the make must be Chevrolet. However, unless the cardinality of your data is very, very low, don’t do this. Creating static rules to deal with data semantics is like pulling at a loose thread on a sweater: you’ll never find the end of it, and it’ll just make a mess in the meantime.
In this post, I’ve highlighted the unexpected behavior of DQS composite domains in data cleansing operations, along with a few workarounds to help you get past this issue. As always, comments and alternative suggestions are welcome!