Posted by Tim Mitchell on 15 May 2012, 08:39

I’m working on a more comprehensive review of last week’s SQL Rally event, but I’d like to go ahead and share my slide deck and photos from the event.

For those who attended my Data Quality Services session on Thursday, thanks so much for coming.  I had 100 or so in attendance, and a lot of good questions and discussion on this topic.  You can download the slide deck here.

If you saw me at the event, you know that I didn’t go anywhere without my camera Smile  I have a few hundred pictures from the event that I’ve loaded onto my Flickr site.  You can view or download those pictures here.

As I’ve been working with Data Quality Services over the past few weeks, I’ve spent a lot of time working with data domains, composite domains, and rules.  In that exploration, I’ve found some behavior that might not be expected when performing cleansing operations against a knowledge base containing a composite domain.

In this post, I’ll outline the expected data cleansing behavior for composite domain value combinations, and will show how the actual result is not what one would expect in this case.  I’ll also briefly describe a couple of workarounds to address this issue.

Overview

Here’s the layout of the issue at hand.  Composite domains can be created in knowledge bases in DQS, and encompass two or more existing organic domains within the same knowledge base.  Those composite domains can then be leveraged in a data cleansing project; if you engage all of the included domains that are part of a composite, that composite domain will automatically be included as part of the cleansing operation.  Now from here a reasonable person (and by “a reasonable person,” I mean me) could assume that if the composite domain is used as part of the cleansing operation, that it would perform the cleansing operation across the product of the composite domain rather than just the individual domains therein.  However, my experimentation has found otherwise.

Make sense? Don’t worry – if I lost you in the problem description, I think a simple example should bring it back into focus.

Example

I’ve been using automotive data for a lot of my recent DQS samples, so we’ll stick with that for now.  I’ve got a reference set of data with (mostly) valid automobile data that I’m using to build a DQS knowledge base through the knowledge discovery activity.  Included in the reference data are automobiles of various make and model, among them the Chevrolet Camaro and several flavors of Ford automobile (we’ll get back to these specifics in a second).  When I import this information through knowledge discovery, it renders both Ford and Chevrolet as valid automobile makes, and the Camaro is present as a valid model of automobile.

image

Now, I want to create an association between make and model, since model is mostly dependent on make.  I create a new composite domain in my knowledge base, and use the combination of Make and Model domains to build this new composite domain.

image

With that done, I’ll republish the knowledge base, and we’re good to go.  Next, I’ll create a DQS cleansing project that will leverage the knowledge base we’ve built with this automobile data.  I’m going to use a smaller and dirtier set of data to run through the cleansing process.  This data will also bring to light a counterexample of the expected behavior of the composite domain.

When I wire up the table containing the dirty data to the new cleansing project, I get the option of including the composite domain since I’m leveraging both of the elements of that composite domain against the data to be cleansed.  By clicking the View/Select Composite Domain button I can see that the Make and Model composite domain is used by default.

SNAGHTML748b8ba

Before I run the cleansing operation on this DQS project, let’s peek at the data we’ll be cleansing in this new project:

image

You’ll see that I called out a particular entry, and it’s probably clear why I referenced the Camaro earlier.  In our dirty data we have a Ford (valid make) Camaro (valid model), but there’s no such thing as a Ford Camaro in production or in our knowledge base.  When the make and model domains are individually verified, this record would be expected to go through the cleansing process with no changes at all.  However, because we’ve got a composite domain set up to group together the make and model, I expect this to fall out as a new entry (rather than a match to something existing in the knowledge base) since our KB does not have the Make and Model combination of Ford Camaro.

However, when I run the cleansing operation and review the results, what I find is not what I expected:

image

Under the Make and Model composite domain results (notice the individual Make and Model domains are not present, since we’ve engaged the composite domain), I find that the incorrect Ford Camaro entry is shown, but instead of showing up under the New tab, it instead surfaces in the Correct tab indicating that the value is already present in the knowledge base.  Given that the displayed reason indicates a “Domain value” match, this seems to indicate that, despite the use of the composite domain, the individual domains are instead being used for aligning the cleansed data with the information in the knowledge base.

Workarounds?

Ideally, what we’d see is the Ford Camaro entry pushed to the New tab since there is no such combination in the KB.  However, there are a few limited options to work around this.

First, you could create a separate field containing the entire make and model combination in your source data, and perform the Make + Model validation against the single field.  This is probably the most realistic workaround as it doesn’t require a lot of static rules.  However, it still means that you will likely need to reengineer the way you stage the data.  It’s a generally accepted practice to store data elements as atomic units, and building a Make + Model field limits your options or forces you to undo that same operation later in the ETL.

You also have the option to create rules against your composite domains to set if/then scenarios for data validation.  For example, you could create a rule that dictates that if the car is a Camaro, the make must be Chevrolet.  However, unless the cardinality of your data is very, very low, don’t do this.  Creating static rules to deal with data semantics is like pulling at a loose thread on a sweater: you’ll never find the end of it, and it’ll just make a mess in the meantime.

Resolution

I’d like to see this behavior fixed, as I think it will lead to confusion and a lot of extra work on the part of data quality and ETL professionals.  I’ve created a Connect bug report to address this behavior, and I’m hopeful that we’ll see a change in this behavior in a future update or service pack.  Feel free to add your vote or comments to the Connect item if you think the change I describe would be useful.

Conclusion

In this post, I’ve highlighted the unexpected behavior of composite domains in data cleansing operations, along with a few workarounds to help you get past this issue.  As always, comments and alternative suggestions are welcome!

In Data Quality Services, composite domains can be created to associate together two or more natural domains within a knowledge base.  Like natural domains, composite domains can also contain one or more validation rules to govern which domain values are valid.  In my last post, I discussed the use of validation rules against natural domains.  In this post, I’ll continue the thread by covering the essentials of composite domain rules and demonstrating how these can be used to create relationships between data domains.

What is a composite domain?

Before we break off into discussing the relationships between member domains of a composite domain, we’ll first touch on the essentials of the latter.

Simply, a composite domain is a wrapper for two or more organic domains in a knowledge base.  Think of a composite domain as a virtual collection of dissimilar yet related properties.  As best I can tell, the composite domain is not materialized in the DQS knowledge base, but is simply a meta wrapper pointing to the underlying values.

To demonstrate, I’ve created a knowledge base using a list of automobile makes and models, along with a few other properties (car type and seating capacity).  I should be able to derive a loose association between automobile type and seating capacity, so I’ll create a composite domain with those two domains as shown below.

image

As shown above, creating a composite domain requires nothing more than selecting two or more domains from an existing knowledge base.  After the composite domain has been created, your configuration options are generally limited to attaching the composite domain to a reference data provider (which I’ll cover in a future post) and adding composite domain rules.

Value association via composite domain rules

The most straightforward way to associate the values of a composite domain is to create one or more rules against that composite domain.  When created against a composite domain, you can use rules to declare if/then scenarios to describe allowable combinations therein.

Back in the day, before marriage, kids, and a mortgage, I used to drive sports cars.  Even though that was a long time ago, I do remember a few things about that type of automobile: they are fast, expensive to insure, and don’t have a lot of passenger capacity.  It’s on that last point that we’ll focus our data quality efforts for now.  I want to make sure that some sneaky manufacturer doesn’t falsely identify as a sports car some big and roomy 4-door sedan.  Therefore, I’m going to create a rule that will restrict the valid domain values for seating capacity for sports cars.

I’ll start with some business assumptions.  What’s the minimum number of seats a sports car should have?  I think it’s probably 2, but I suppose if some enterprising gearhead decided to convert an Indy Car into a street-legal machine, it would likely be classified as a sports car too.  Therefore, it would be reasonable to assume that, in an edge case, a sports car could have just a single seat, so our minimum seating capacity for a sports car would be 1.  On the high side, design of sports cars should dictate that there aren’t many seats.  For example, the Chevrolet Camaro I had in high school could seat 4 people, assuming that 2 of the people were small children with stunted growth who had no claustrophobic tendencies.  However, we can give a little on this issue and assume that they somehow manage to squeeze a third rows of seats into a Dodge Magnum, so we’ll say that a sports car can have a maximum seating capacity of 6 people.

Now, with that information in hand, I’m going to use the Domain Management component of the DQS client to set up the new rule against the “Type and Capacity” composite domain from above.  As shown below, I can set value-specific constraints on the seating capacity based on the automobile type of Sports Car.

image

As shown, any valid record with a car type of Sports Car must have a seating capacity of between 1 and 6 persons.

Of course, sports cars aren’t the only types of automobiles (gasp!), so this approach would likely involve multiple rules.  Fortunately, composite domains allow for many such rules, which would permit the creation of additional restrictions for other automobile types.  You could also expand the Sports Car rule and add more values on the left side of the operator (the if side of the equation).  For example, you might call this instead a “Small Car rule” and include both sports cars and compact cars in this seating capacity restriction.

Other uses

Although we’ve limited our exploration to simply interrogating the value of the natural domains within a composite domain, this is by no means our only option for validation.  For example, when dealing with string data you can inspect the length of the string, search for patterns, use regular expressions, and test for an empty string in addition to checking against the actual value.  Shown below are some of the options you can use to query against a string value in a domain rule.

image

When dealing with date or numerical data, you have the expected comparison operators including less than, greater than, less than or equal to, etc.

Conclusion

This post has briefly explored composite domains and shown how to add validation rules to a composite domain in an existing knowledge base.  In my next DQS post, I’ll continue with composite domains to illustrate a potential misunderstanding in the way composite domains treat value combinations in cleansing operations.

Posted by Tim Mitchell on 04 May 2012, 07:15

dqsA compelling feature of the new Data Quality Services in SQL Server 2012 is the ability to apply rules to fields (domains) to describe what makes up a valid value.  In this brief post, I’d like to walk through the concepts of domain validation and demonstrate how this can be implemented in DQS.

Domain validation essentials

Let’s ponder domain validation by way of a concrete example.  Consider the concept of age: it’s typically expressed in discrete, non-negative whole numbers.  However, the expected values of the ages of things will vary greatly depending on the context.  An age of 10 years seems reasonable for a building, but sounds ridiculous when describing fossilized remains.  A date of “1/1/1950” is a valid date and would be appropriate for classifying a person’s date of birth, but would be out of context if describing when a server was last restarted.  In a nutshell, the purpose of domain validation is to allow context-specific rules to provide reasonableness checks on the data.

A typical first step in data validation would involve answering the following questions:

  • Is the data of the right type?  This helps us to eliminate values such as the number “purple” and the date “3.14159”.
  • Does the data have the right precision? This is similar to the point above: If I’m expecting to store the cost of goods at a retail store, I’m probably not going to configure the downstream elements to store a value of $100 million for a single item.
  • Is the data present where required?  When expressing address data, the first line of an address might be required while a second line could be optional.

Domain validation goes one step further by answering the question, “Is a given value valid when used in this context?”  It takes otherwise valid data and validates it to be sure it fits the scenario in play.

Domain validation in DQS

Even if you don’t use this term to describe it, you’re probably already doing some sort of domain validation as part of your ETL or data maintenance routines.  Every well-designed ETL system has some measure sanity check to make sure data fits semantically as well as technically.

The downside to many of these domain validation scenarios is that they can be inconsistent and are usually decentralized.  Perhaps they are implemented at the outer layer of the ETL before data is passed downstream.  Maybe the rules are applied as stored procedures after they are loaded, or even as (yikes!) triggers on the destination tables.

Data Quality Services seeks to remedy the inconsistency and decentralization issue, as well as make the process easier, by way of domain validation rules.  When creating a domain in DQS, you are presented with the option of creating domain rules that govern what constitutes a valid value for that domain.  For the example below, I’m using data for automobile makes and models, and am implementing a domain rule to constrain the value for the number of doors for a given model.

SNAGHTML612bbb0

With the rule created, I can apply one or more conditions to each of the rules.  As shown, I am going to constrain the valid values to lie between 1 and 9 inclusive, which should account for the smallest and largest automobile types (such as limousines and buses).

SNAGHTML6292bf0

For this rule, I’m setting the conditions that the value must be greater than zero or less than ten.  Note that there is no requirement to use this bookend qualification process; you can specify a single qualifier (for example, greater than zero) or have multiple conditions strung together in the same rule.  You can even change the AND qualifier to an OR if the rule should be met if either condition is true – though I would caution you when mixing 3 or more conditions using both AND and OR, as the behavior may not yield what you might expect.

That’s all there is to creating a simple domain validation rule.  Remember that for the condition qualifiers, you can set greater than, less than, greater than/equal to, etc., for the inclusion rule when dealing with numerical or date domain data types.  For string data types, the number of options is even greater, as shown below:

image

Of particular interest here is that you can leverage regular expressions and patterns to look for partial or pattern matches within the string field.  You can also check the string value to see if it can be converted to numeric or date/time.

The rule in action

With the new domain validation rule in place, let’s run some test data through it.  I’m going to create a few test records, some of which violate the rule we just created, and run them through a new DQS project using the knowledge base we modified with this rule.

I’ll start off with the dirty data as shown below.  You can probably infer that we’ve got a few rows that do not comply with the rule we created, on both ends of the value scale:

image

After creating a new data cleansing project, I use the data shown above to test the rule constraining the number of doors.  As shown below in the New output tab, we have several rows that comply with this new rule:

SNAGHTML65f97ec

In addition, there are two distinct values found that do not meet the criteria specified in the new rule.  Selecting the Invalid tab, I see the values 0 and 12 have failed validation, as they fall outside the range specified by the rule.  In the Reason column, you can see that we get feedback indicating that our new rule is the reason that these records are marked as Invalid:

SNAGHTML661229e

So by implementing this rule against my data, I am able to validate not only that the value is present and of the correct type, but that it is reasonable for this scenario.

Conclusion

In this post we’ve reviewed the essentials of domain validation and how we can implement these checks through domain rules in SQL Server Data Quality Services.  In my next post, I’ll continue the discussion around domain rules by reviewing how these rules can be applied to composite domains in DQS.

Posted by Tim Mitchell on 04 May 2012, 00:40

Last month I made the trip with some other Dallas-area speakers down to Houston’s second annual SQL Saturday event.  This was one of the shorter trips for me – we drove down on Friday afternoon/evening, and left out late Saturday.  It was good to be away from home for just one night (rather than the two nights I normally stay for SQL Saturday), but it definitely made for a long Saturday.

The facility for this event was not too bad.  It was some sort of educational institution for at-risk kids, and had fully equipped (although dated) video equipment.  One of my two sessions was the first one of the morning, and the room I was in had a bad projector in it.  A couple of guys from the school crew (they had folks there on Saturday for the event, which turned out to be quite useful) came and replaced the projector – with another bad one.  The third one finally worked, but it wouldn’t mount on the ceiling so the parked it on the desk – a little awkward with space tight already, but we made it work. 

Registration was very tough.  Houston was one of the first to try out the online check-in process on the SQL Saturday website, but unfortunately this process relies on attendees remembering their PASS website login.  Compounding the problem was the fact that they only had one computer for people to use for check-in.  I know that Nancy Wilson, the Houston group leader, has been in touch with PASS staff about a better way to handle check-ins without going back to paper.

Lunch was good.  Like last year, they catered barbecue for the event (in Houston, is there anything else?).  The vendors were all set up in the cafeteria where lunch was served, so it worked well to get some face time with attendees before and after they ate.

Like last year’s event, I thought it was well organized.  My only two suggestions would be:

  • Registration changes.  Hopefully we’ll get a web-based way to check in attendees without having them log into their PASS accounts at the point of check-in.
  • Timely notification to speakers of their selection.  Speakers were notified just a month ahead of time, which is an awfully short time frame since many (most?) of the speakers are coming from out of town and need to make travel arrangements.  It wasn’t as much of an issue for me since I didn’t fly, but I suspect that a few of the speakers probably got pinched on airfare due to the late notification.

Overall a good event!  I’ll definitely be back for next year.  By the way, I took a few pictures and posted them on my Flickr account.