Fish and Chips, Robin Hood, SSIS, and Me

I’m going to Nottingham, England, fabled home of Robin Hood, where I’m going to eat fish and chips and talk about SSIS.  How’s that for tying seemingly unrelated topics together?

I’m happy to report that in addition to my full-day preconference seminar at SQLBits, I’ll also be delivering a regular session entitled Cleaning Up Dirty Data with SSIS on Friday, May 3.  For those unfamiliar, SQLBits is the largest and most well-known SQL Server event outside of the US.  This year’s event will be held in Nottingham, England from May 2-4.  I’ll be speaking all day on Thursday, May 2nd, with the data cleansing session to follow on Friday.

Registration is still open, and until next week you can still get in on the early bird pricing.  If you make it to SQLBits, please do stop by and say hello.

Join me at SQLBits for a day of SSIS

SQLBitsLogoI’m happy to announce that I’ve been selected to deliver a full day of training at the upcoming SQLBits conference in Nottingham, England.  This will be my first time taking part in SQLBits (in fact, it will be my first trip to the U.K.) and I’m very much looking forward to the experience.

I’ll be delivering a full day presentation on Thursday, May 2 entitled Real World SSIS: A Survival Guide.  Our mission for the day will be to examine challenges common to many ETL professionals and demonstrate how to overcome them.  From error handling to logging, and performance tuning to data cleansing, we’ll move beyond the basics to demonstrate real-world approaches to these challenges.  Along with the discussion, we’ve got lots of demos to work through these solutions.

Registration is now open for SQLBits.  You can find out about pricing here (including details about early bird deals), and register here.

I hope to see you there!

Want to see me break SSIS this Thursday?

OK, I’m not really going to break it.  I’m just going to beat it up a little bit.

This Thursday evening, I’ll be working my home turf at the North Texas SQL Server User Group monthly meeting, presenting “When ETL Goes Bad: Handling Errors and Anomalies in SSIS” at 6:15 at the Microsoft campus.  In this hour-long presentation, I’ll go through some common error and data anomaly scenarios and demonstrate some ways to error-proof your packages.

Here’s the full synopsis of my presentation:

Even in the most reliable ETL system, it’s eventually going to happen: A package will fail, an output row count won’t match what is expected, a data anomaly will force a forensic search of the package infrastructure… In this session, we’ll talk about some of the things that can go wrong during package execution – and some of the defenses that can be built into SSIS to help head off those errors (or at least make the errors easier to troubleshoot). From error handlers to package restartability, and from logging methods to row count validation and remediation, we’ll cover various ways to keep your ETL flowing when problems arise.

After my presentation, my friend Dave Stein kicks off a new series he’s going on data warehousing.  Hope to see you there!

Ragged Flat File Processing in SSIS

Frequently I am asked how to properly use SQL Server Integration Services to handle ragged flat files, an example of which is shown below.  In this small sample file, you can see that we are dealing with medical data that has an inconsistent format.  As shown, the rows in the pipe-delimited file can have as few as seven and as many as eleven columns.

image

Although you and I can look at this file and interpret its meaning, previous versions of SSIS cannot (see note below regarding SSIS in SQL Server 2012).  You can see that we’re not only missing the data but the delimiters as well, which is the root of the problem.  This type of format where delimiters are simply left off when values at the end of the row are missing or NULL is, unfortunately, a relative common event that ETL developers must address.

With a file like this, SSIS will continue to consume data for each field until it find the specified field delimiter – even if that means that the data is “gobbled up” from the next line in the file!  Let’s assume that the outputs on my Flat File Source component define 11 fields (the largest number of fields found in this file snippet).  When SSIS processes the first row of data, it’s going to get to the end of that row still expecting to find a pipe delimiter to indicate the end of that field.  Rather than assuming the end of the line of data (represented by an unprintable newline character) is actually the end of the column, the Flat File Source component will continue to consume data until either it finds the specified delimiter or the length of the data consumed exceeds the width specified for that column.  In this example, the eighth field would not be the literal “453” at the end of the line in the file, but would contain a newline character followed by the first value of the following line of data.  Then the next field, which should be NULL, would contain “Reichert”, followed by the incorrect value of “781” and so forth.  As shown below, this slip would cascade through the remainder of the file.

file

In a case like this, the very best you can hope for is that the data length would overflow the allowed field width, or a data type conflict would cause an error and fail the package.  In a worst-case scenario, SSIS could actually consume the entire file with this misalignment as shown above, which may not raise an error in the package but would almost certainly cause downstream issues in the destination systems.

Scripting to the rescue!

Never fear – there is a solution to this problem.  By using just a bit of code in SSIS, we can easily get around this issue and process the ragged flat file.  The methodology we’ll use is as follows:

  • Using the Flat File Source, consume each line of text as one big string (rather than breaking it apart into individual fields)
  • Using the Script Component, parse each line of the file.  Dynamically populate the output fields with the data in each line of the file, and assign NULLs to those fields where no value is present.
  • Send the cleansed output to the destination table

Sounds easy enough, right?  Let’s go through this step by step.

Process each line of data

As mentioned, we’re going to trick our Flat File Source component into processing this file by forcing it to handle each line of data as if it were a single text field.  As shown below, I’ve configured a single output field, named [Column 0], as a field of type String with a width of 2000 characters.  This will include everything on the line of data, including the pipe delimiters, in the output.

SNAGHTML10b3b6d7

Next, we’ll add a script component into the data flow pane.  We’ll configure the input of the script component to use the single field generated by the flat file source above.

SNAGHTML10c1bb98

We also need to add all of the output fields manually, which we will populate from within the script.  As shown below, I’ve already added the eleven fields to the script component output, configuring each one with the appropriate data type and length.

SNAGHTML10c442c2

Now onto the fun part: The Code.  Within the script, our methodology will be as follows:

  • Break apart the big text string into its individual data fields, splitting on the specified delimiter (in this case, the pipe symbol).
  • If there are mandatory fields (i.e., any fields which must be present for a valid row of data), assign these to the appropriate output columns.
  • For the optional output columns, process the values retrieved from the input string, in order, assigning each one to the next output column until no other input values exist.

First things first – let’s grab and shred the input text.  As shown below, I’m going to split the input on the pipe delimiter, resulting in an array of strings, each with a single value.

// Break apart the input string into its atomic elements
string[] allValues = Row.Column0.Split(‘|’);

Next, we’ll work through all of the fields that are expected to be present in all valid rows of data.  In situations where there are mandatory fields which should always be present, no dynamic processing is required on those values, so we can simply assign them in order.  Note that I’ve used an incrementing counter variable to work through the array of input data – this is simply a shortcut step to allow me to copy/paste the same code rather than typing the cardinal numbers for each element.

// Counter
int i = 0;// Address the values known to be present in every row
Row.MRN = allValues[i++];
Row.PatientLastName = allValues[i++];
Row.PhysicianID = allValues[i++];
Row.ProcedureDate = allValues[i++];
Row.ProcedureID = allValues[i++];
Row.ProcedureDescription = allValues[i++];
Row.ProcedureStatus = allValues[i++];

Now we move on to the dynamic portion of the processing.  From this point, all columns are optional, so we need to go through the remaining values, one by one, and assign them to the final four output columns (all related to risk factor codes in this example).  As shown in the code snippet below, I’m continuing to work through the string array created earlier process all of the values.  As long as values remain in the array, we’ll continue assigning those values to the next risk factor code in order; when we run out of values, all remaining risk factor columns will be populated with a NULL string.

// The next four values are variable.  We may have zero to four of these, so
//  we’ll add them dynamically as they appear
Row.RiskFactor1 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor2 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor3 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor4 = allValues.Length > i ? allValues[i] : null;

We only have four possible additional columns in this scenario, but this pattern can easily be applied to similar scenarios with more than just a few optional elements.

Consume the cleansed data

Finally, we just need to route the cleansed data out to our destination.  As shown below, we’re sending the processed data to a SQL Server table.

image

SQL Server 2012 changes

As mentioned earlier, this is a design pattern that I recommend for versions of SSIS prior to 2012.  If you are using SSIS with SQL 2012, this workaround is unnecessary: the Flat File Source was updated in this latest version so that it allows easy processing of these types of ragged files.

Conclusion

Ragged flat file processing can be a pain when working with older versions of SQL Server Integration Services.  This post demonstrates how, with just a few lines of code, you can avoid issues with ragged files by dynamically “peeling off” values and assigning them to the appropriate outputs.

Note: The package and sample data file I used for this demonstration are attached here.

2012 in Review

With the year 2012 now in the rear view mirror, and while I’m still in the midst of my holiday vacation (two whole weeks! Feels very strange…) , I thought I’d take stock of my tasks and accomplishments of the past year.

Sharing knowledge

I must sound like a broken record when I say this, but I feel incredibly honored that I get to spend so much time in the community sharing what I’ve learned.  Over the years I’ve had the opportunity to and speak at dozens of different events, and each time it feels like a homecoming.  To get to meet and get to know all of the great people I’ve encountered, to be allowed to share with them the humble bits of knowledge I’ve accumulated in my decade or so of working in this business makes it feel less like work and more like an award.  A big thanks to everyone in the SQL Server community!

In 2012, I got the opportunity to present 10 talks at six different events in four states:

  • SQL Saturday Houston (2 sessions)
  • SQL Rally in Dallas
  • SQL Saturday Oklahoma City
  • SQL Saturday Dallas – BI Edition (one session plus a full-day precon)
  • SQL Saturday Oregon (Portland)
  • SQL PASS Summit in Seattle (two general sessions plus a full-day precon)

Doing face-to-face presentations is still my favorite way of sharing knowledge, because it allows me direct and immediate feedback from the audience.

On that topic, the Dallas event was the first time I’d delivered a full day of content without a copresenter.  With an audience of about 30 people, I presented “Real World SSIS: A Survival Guide”, which was seven hours of realistic lessons I’ve learned in ten or so years of working in this field.  I really enjoyed the experience – the audience was very receptive, and seemed to get a lot out of our time together.  I’m planning to do this presentation again sometime in 2013.

Writing

I also finalized a two-year book project this year.  Along with SQL community rock stars Andy Leonard, Michelle Ufford, Jessica Moss, and Matt Masson, I wrote and published a book entitled SSIS Design Patterns.  When Andy and I first started talking about this book two years ago, I never imagined that it would be so much work – or so rewarding.  This was the first significant book project I had taken on (I wrote one chapter for the MVP Deep Dives v2 book in 2011, but that was a small contribution compared to the time investment in the design patterns book) and it was an enlightening experience.  Though it was a lot of work, I’m already thinking about my next book project.

I also blogged a little – very little.  When I look back at my blog archive, I’m disappointed in myself that I’ve reduced so significantly the amount of technical blog posts I publish.  Where I used to write several blog posts a month (sometimes several per week), I’m averaging less than one technical post a month now.  There’s really no excuse for this – I’m good at writing, I enjoy it, and I just need to make that a priority.  Same thing with technical forum participation.  I used to participate a lot by answering questions in technical forums, which I’ve all but stopped now.  I don’t think it’s an accident that the amount of forum participation and the number of blog posts decreased at about the same rate – I got a lot of blog ideas from answering questions on the forums.  If I increase forum participation, I expect that I’ll have more than enough ideas and motivation to get back in the groove of blogging.

Learning

While I spent a lot of time investing in teaching others, I don’t think I did a great job of deepening my own knowledge this year.  Although there is an implicit self-learning component of any teaching endeavor (after all, the teacher should be knowledgeable enough about the topic that he/she isn’t stumped by trivial questions), I didn’t spent a lot of time purposefully learning new things just for the sake of knowledge.

And on to 2013

So it’s been a busy year, no doubt.  Moving on to 2013, I’ve got a few things planned  and a few goals I want to meet:

  • Speak at an international conference.  I’ve submitted to speak at SQLBits in Nottingham, UK, and I hope to be able to present one or more topics at that event.  This is the best-known SQL conference outside of the United States, and has a reputation for treating speakers and attendees well.  (Shameless plug – there’s still time to vote for my sessions!  You can sign up for a free account and vote on the SQLBits website.)
  • Blog real technical content at least twice a month.  Of the 20 blog posts I published last year, at least half of them were nontechnical in nature (where I’m going, where I’ve been, etc.).  I’m going to plan for two technical blog posts per month.
  • Stay involved in social media.  I’m not aiming for specific metrics here, but this year I throttled back how much time I spent using social media.  Though I don’t have the aim to keep my nose glued to Twitter, it is my belief that social media is a key component to keeping in touch with what’s going on in the industry.
  • Define work boundaries.  I can’t tell you how many hours I’ve spent when I’ve been “sorta working.”  You know, that place where you’re trying to work but aren’t really concentrating due to distraction from kids/TV/life in general.  Multitasking has diminishing returns, so I’m going to do better at defining when I’m working and when I’m not.
  • Start writing my next book.  I’m planning at least two new books, one technical and one nontechnical, and I’m going to start at least one of them this year.
  • Continue speaking.  Other than SQLBits, I’ve not yet submitted to any conferences, but I’m planning to hit at least one event per quarter (hopefully more).

I Need Your Vote!

Do your civic duty, and vote today! 

Voting is now open for SQLBits, to be held in Nottingham, UK this May.  This event is probably the best-known SQL Server event outside of the US, and unlike most other such events, they use a popular-vote model for general session selection.  I’ve submitted three general sessions:

Registration for this event is still open, as is voting for the general sessions (you’ll have to log in to vote).  If I’m selected for SQLBits, this will be my first experience with this event I’ve heard quite a bit about.  Hope to see some of you there!

The Great Blog Move of 2013

For at least a year, I’ve been considering changing blog platforms.  I set up my blog on TimMitchell.net using BlogEngine.NET several years back, a decision I made based primarily on the fact that this tool was built on ASP.NET and SQL Server as opposed to the LAMP platform used by WordPress, the real behemoth in the blogging world.  I told myself that I’d be more likely to customize my blog if it was written in languages and tools I use every day as opposed to open source technologies that, while fully capable, aren’t in my wheelhouse.

However, it didn’t quite work out that way.  While the idea of pimping my blog was (and still is) appealing, the fact is that I don’t have the spare cycles to work in a bunch of custom web development work.   An.d while I’m not disappointed in the native offerings of the BlogEngine.NET platform, I quickly discovered that WordPress led the way in features and compatibility.  Therefore, I decided that eventually I would like to move to WordPress.

Easier said than done

Who knew that it would be so much trouble to move blog platforms?  One would think that, since we’re just talking about XML data, it should be easy enough to port from one platform to another.  And it probably would have been, were I more well versed in the ways of XML.  However, since I’m not, I went out to find some tool or utility to allow me to move all of my content to WordPress without having to do so manually.  I found a few vague blog posts about how one *might* do this, but apart from doing a lot of manual data manipulation, there didn’t seem to be a silver bullet for solving this problem. 

After some searching, I did find a CodePlex project named Blog Migrator that purported to handle migration from one platform to another.  Although it didn’t work as coded – all Post ID fields were treated as integers in code, while BlogEngine.NET uses a character field – I was able to tweak the code to get the posts loaded from my BlogEngine site (via the export file in BlogML format) to my new WordPress site.  Sadly, I couldn’t get the comments to transfer automagically, so I had to do some ETL to get them pushed over to WordPress.  My blog categories also didn’t transfer, but I’m not sweating that now – I may go back later and migrate these (or just recategorize everything if I get really bored).

New Host

As part of this move, I also changed web hosts.  I have been using SectorLink for several years, and apart from some recent issues with reliability and customer service, it has been a good web host for me.  However, in moving to WordPress, I wanted to find a host that focused on that platform rather than a generic do-it-yourself site.  I considered going directly through WordPress.com (and actually experimented with a free site there) but found that it did not offer a lot of flexibility in terms of adding new features.  I heard good things about DreamHost, a service targeting WordPress users.  I set up an account there, and found that they offered the structured system I wanted while also allowing for maximum flexibility (for example, I can add new plug-ins to my DreamHost site, which I could not do when using a WordPress.com hosted site).

It’s still very early in the relationship, and I haven’t yet had a chance to experience some aspects of DreamHost’s offerings (specifically, I’ve not yet had to contact customer service, an area where my last web host did not excel).  But so far I’ve been satisfied with the features and documentation that I’ve found.

Moving forward

I’ve still got a few things to check on – for example, some of my older posts used a different image path than the newer ones, so I need to confirm that all of my images are showing up correctly.  Since I’m syndicated on SQLServerCentral.com, I need to confirm that syndication is working correctly.  If the worst happens and DreamHost doesn’t work out, I’ve still got my old site intact and can revert to that content in a matter of hours (as long as it takes for DNS to update).  Hopefully, though, I’ll find that WordPress and my new host meets all of my needs.

I don’t really care for the WordPress theme I’m using, but it was easy to use and didn’t remind me of anyone else’s WP theme.  It’s on my list of things to change, but I’m a function-before-form type so I’ll add this lower on my to-do list.

And of course, if either one of you loyal readers finds anything out of place due to the move, please let me know (I have a handy contact form just for that).