Webinar: Scripting and SSIS

Tomorrow at 10am (11am EDT), I’ll be joining together with my good friend and SSIS Design Patterns coauthor Andy Leonard for a free one hour webinar to discuss scripting in SQL Server Integration Services:

Join SQL Server MVP Tim Mitchell and Andy Leonard as they discuss and demonstrate scripting in SSIS! In this demo-packed session, two co-authors of the book SSIS Design Patterns share their experience using the Script Task and Script Component to accomplish difficult transformations and improve data integration.

You can register online here.  We look forward to seeing you tomorrow!

Colorado speaking tour

I’m happy to announce that I’ll be doing a tour of the 4 SQL Server user groups in central Colorado next week.  I’ll be speaking at these four user groups:

Monday (3/18): Northern Colorado Database Professionals

Tuesday (3/19): Boulder SQL Server User Group

Wednesday (3/20): Colorado Springs SQL Server User Group

Thursday (3/21): Denver SQL Server User Group

At each of these user group meetings, I’ll be delivering a session entitled “When ETL Goes Bad: Handling Errors and Anomalies in SSIS”.  This is the second such tour that I’ve done: two years ago I made trip up there and spoke at 3 of these user groups.  It’s always a pleasure to visit this area – the Colorado SQL community has a lot of great people, and I’ve got several friends in the area that I hope to see.

This will be a working trip, too.  I’ve got a client in the Denver area that I’ll be working with during the week, so I expect it to be a very packed week.  I’m taking the day off on Friday and am staying over until Sunday morning, so I should have two full days to squeeze in some skiing while I’m there.

If you’re in the area, please stop by one of the user group meetings and say hello.  And if you’re up for some skiing on Friday or Saturday, let me know Smile.

Business Intelligence Projects now supported in Visual Studio 2012

Since the release of Visual Studio 2012, business intelligence developers have been limited in how much they could use this tool due to the fact that it did not support BI project types (SSIS, SSAS, and SSRS).  Today, that limitation is now gone with the release by Microsoft of SQL Server Data Tools – Business Intelligence for Visual Studio 2012.  With this release, BI professionals may now fully move onto Visual Studio 2012 for SQL Server 2012 project initiatives.

You can read more about this release on the SSRS team blog and SSAS team blog, or download the code from the Microsoft website.

Using Object Typed Variables in SSIS

Note: This will be the first post in a short series on using Object typed variables in SQL Server Integration Services.

When defining variables in SSIS, the ETL developer has several data type options to choose from depending on the information to be stored in each variable.  Included in the options are String, Boolean, Char, DateTime, and several flavors and sizes of Int.  However, there’s another variable data type that is very handy but also underutilized: the Object data type.

fig1

Why Object Variables?

Variables with a data type of Object are the most flexible variables in SSIS.  Data types such as Int, String, and DateType are designed to store just one type of data; however, an Object typed SSIS variable can store almost any type of data, even information that can’t otherwise be represented in Integration Services.  In most cases, SSIS doesn’t even have to be configured to know what type of data you’re storing in an Object typed variable – usually, it can simply pass the value along the wire as a bunch of bits without knowing or caring what’s in there.

When considering the use of Object typed variables in SSIS, I’ll give the same disclaimer that I give for using script tasks/components in SSIS: Just because you can doesn’t mean you should.  If a native data type will work to store any possible value for a particular variable, by all means, don’t complicate your code by adding unnecessary moving parts.  Use Object typed variables only when a native type won’t do – otherwise, stick to the well-worn path.

Common Uses

There are a few cases that come to mind that lend themselves to using Object typed variables in SSIS:

  • Iterating over a result set.  This is probably the most common and well-documented use of Object variable, as well as the easiest to implement as it requires no manual coding.  This pattern will allow you to retrieve a set of data from a database, and then perform some operation for each row in that result set.  I’ll work through the mechanics of how to do this in my next post.
  • Handling binary data.  A common example of this is extracting binary data from or writing binary data to a VARBINARY field in a relational database.  If you need an interim storage mechanism in which this binary data should be stored, and Object variable can be a good solution.  In this case, the ETL pipeline doesn’t necessarily need to be aware of what is stored in the Object typed variable .
  • Creating or consuming binary data in SSIS.  Let’s say you need to either generate or process binary data as part of your ETL.  For example, you might need to retrieve a JPEG or PNG file from the file system and write it into a relational database, or retrieve a binary object from a database and process the various elements of that object.  By storing said data in an Object typed variable, you can directly write to or read from this variable within your code.

Risks and challenges

Naturally, with a construct as flexible as an Object typed variables, there are a few challenges to be aware of when considering when and how to use objects in your SSIS packages.  Among the risks:

  • Some coding required.  In many cases, especially when you’re processing the information contained in the Object typed variable (as opposed to simply passing the value through from a source to a destination), you’re going to have to write some code to address that object.
  • SSIS expressions not allowed.  Because they are designed to store a variety of information structures, Object typed variables cannot be used in an SSIS expression.  Even if the underlying data stored in the variable is of a type that could be stored in a native SSIS type, attempting to add an Object typed variable to an SSIS expression will throw an error.
  • Debugging challenges. If you overflow an Int32, or happen to truncate a string in SSIS, you’ll get a (mostly) friendly error message telling you what went wrong.  Often, when dealing with Object typed variables, you don’t get that luxury in SSIS.

Conclusion

Object typed variables in SQL Server Integration Services allow a great deal of flexibility when dealing with atypical data structures in the ETL cycle.  Although they are not as commonly used as simple native types, Object typed variables can make otherwise difficult ETL tasks easier.

In my experience, I’ve found that Object typed variables are sometime avoided in SSIS packages simply because they’re misunderstood or believed to be too difficult to use.  In the next few posts in this series, I’ll illustrate how the Object variable can be leveraged in your SSIS package by demonstrating a few use cases where Object typed variables in SSIS are appropriate.

Dust Off That Resume

Since I started regularly attending SQL Saturday events some five years ago, I’ve sat in on a number of professional development sessions by Andy Warren, Buck Woody, Don Gabor, and others.  Each one offered different bits of advice based on his or her own experience, but there was an overriding theme in all of them: Don’t wait until you need a job to start grooming yourself as a candidate.  Start building your network right now, they would all advise, regardless of your current employment status.  Push yourself to learn, especially where you see a shortage of skilled workers.  Stay visible, stay relevant.

But what about that resume?  After all, the resume is just a very small piece of the big picture… a document that be easily thrown together as soon as you need it – right?  (Note: If you nodded after that last sentence, please, keep reading.)

2585643891_dc3e1b8c4c_n“Tell me about yourself…”

Writing an effective resume isn’t easy.  Most people think writing about themselves is easy until they actually go about doing it.  To describe oneself in a way that is flattering but not overly boastful, colorful enough to be interesting yet still truthful, while keeping the description to one or two pages at most, takes a great deal of time and concentration. Sadly, I see some resumes that appear to be an afterthought – just a means to an end, without much planning or proofreading involved.

Resumes that were thrown together at the last minute have several telltale signs:

  • They enumerate every piece of software or hardware you ever touched, without describing how you used said hardware or software to solve actual problems.
  • They are full of filler phrases like “dynamic”, “uniquely qualified”, “fast learner”, “track record”, and “progressive”.
  • They contain too many errors in grammar or spelling.  (How many is too many? Any number greater than zero.)
  • After I read the whole resume, I still have no idea who you are or what you can do for the company.

My friend Steve Jones delivered a professional development presentation some time back in which he recommended that everyone touch their resume at least once per quarter, regardless of whether they were actually looking for a new job.  I believed in that advice so strongly that I’ve repeated it numerous times since.  However, like an out-of-shape cardiologist, I’ve been quite adept at ignoring my own advice.  When I recently needed a current copy of my resume for a training initiative, I discovered that I had not updated this document in over three years.  I succumbed to the thought that “I’ve got a good job, I’m not looking to make a move, so it can wait” and let the information go stale.

5217079666_076cdc469a_mBut I’m not looking for a job…

Is keeping your resume up to date really necessary, unless you are (or expect to soon be) looking to make a career move?  I submit that it is important, for several reasons:

  • A properly written resume takes time to create.  Don’t allow yourself to be sucked into thinking that you can spent an hour or two to create a superb resume.  At a minimum, you’re going to need several days to get it right.  A resume isn’t ready to be sent to a prospective employer until you’ve gone over it, word by word, to make sure it’s perfect.  Write your resume, put it down for a few days, and come back and reread it to be sure it really tells a story.  You should engage others as well – get as much feedback as possible before you finalize it.  These things take time!
  • You might not be looking for a job today, but you might be tomorrow.  Let’s face it – for those of us in the ranks of full-time employment, we’re just one really bad day away from joblessness.  Anyone who works for someone else could, on any given day, find himself out of work due to a high-profile error, an unforeseen downturn in business, a personality conflict, or for no reason at all (in many states).  If you find yourself suddenly and unexpectedly looking for a job, you shouldn’t let a stale resume slow down your job search.
  • Your career changes faster than you think, and it’s easy to lose track of those changes.  During the three years that I ignored my resume, I had contributed to two books, was elected to the board of my local user group, received the Microsoft MVP award three consecutive years, and learned several new technologies.  What I thought would be an easy task of documenting three years worth of career changes turned out to be much more work than I expected.  Especially in high-tech fields such as ours, careers can evolve quickly, and an up-to-date resume should reflect those changes.
  • You occasionally need an up-to-date resume for reasons other than getting a job.  At a previous job, I was asked on a few occasions to provide a copy of my resume for the benefit of potential clients of my employer – these prospects wanted to know the kind of people they’d be working with in case my employer was selected as their service provider.  Further, some extracurricular activities (community board service, authorship opportunities, etc.) require the candidate to produce a current resume.

Conclusion

Keeping your resume up to date takes time, and it’s even harder to motivate yourself to keep current if you’re not looking for a job.  But in the same way you continue learning and networking while not actively shopping for a new position, it’s beneficial to keep your resume polished and ready to go.

Side note: If you’re a senior BI professional and are looking for a new challenge, why don’t you send me a copy of that freshly-updated resume?  My employer is hiring, and it’s a great place to work!

Erin Welker joins Artis Consulting

Earlier this week, my employer, Artis Consulting, welcomed a new addition to our team.  Technical expert Erin Welker (b|t), a very well known and experienced business intelligence professional, has joined the BI practice here at Artis.  Erin brings a great deal to the table – she has been a SQL Server BI consultant for a number of years, has served the community in various capacities (including serving on the PASS board of directors), is a board member for the North Texas SQL Server User Group, and is a past recipient of the SQL Server MVP award. 

The BI practice here at Artis has been the strongest team I’ve ever worked with, and I’m very excited to have my friend Erin joining us here.  If you know Erin, be sure to give her a shout of congratulations.

Fish and Chips, Robin Hood, SSIS, and Me

I’m going to Nottingham, England, fabled home of Robin Hood, where I’m going to eat fish and chips and talk about SSIS.  How’s that for tying seemingly unrelated topics together?

I’m happy to report that in addition to my full-day preconference seminar at SQLBits, I’ll also be delivering a regular session entitled Cleaning Up Dirty Data with SSIS on Friday, May 3.  For those unfamiliar, SQLBits is the largest and most well-known SQL Server event outside of the US.  This year’s event will be held in Nottingham, England from May 2-4.  I’ll be speaking all day on Thursday, May 2nd, with the data cleansing session to follow on Friday.

Registration is still open, and until next week you can still get in on the early bird pricing.  If you make it to SQLBits, please do stop by and say hello.

Join me at SQLBits for a day of SSIS

SQLBitsLogoI’m happy to announce that I’ve been selected to deliver a full day of training at the upcoming SQLBits conference in Nottingham, England.  This will be my first time taking part in SQLBits (in fact, it will be my first trip to the U.K.) and I’m very much looking forward to the experience.

I’ll be delivering a full day presentation on Thursday, May 2 entitled Real World SSIS: A Survival Guide.  Our mission for the day will be to examine challenges common to many ETL professionals and demonstrate how to overcome them.  From error handling to logging, and performance tuning to data cleansing, we’ll move beyond the basics to demonstrate real-world approaches to these challenges.  Along with the discussion, we’ve got lots of demos to work through these solutions.

Registration is now open for SQLBits.  You can find out about pricing here (including details about early bird deals), and register here.

I hope to see you there!

Want to see me break SSIS this Thursday?

OK, I’m not really going to break it.  I’m just going to beat it up a little bit.

This Thursday evening, I’ll be working my home turf at the North Texas SQL Server User Group monthly meeting, presenting “When ETL Goes Bad: Handling Errors and Anomalies in SSIS” at 6:15 at the Microsoft campus.  In this hour-long presentation, I’ll go through some common error and data anomaly scenarios and demonstrate some ways to error-proof your packages.

Here’s the full synopsis of my presentation:

Even in the most reliable ETL system, it’s eventually going to happen: A package will fail, an output row count won’t match what is expected, a data anomaly will force a forensic search of the package infrastructure… In this session, we’ll talk about some of the things that can go wrong during package execution – and some of the defenses that can be built into SSIS to help head off those errors (or at least make the errors easier to troubleshoot). From error handlers to package restartability, and from logging methods to row count validation and remediation, we’ll cover various ways to keep your ETL flowing when problems arise.

After my presentation, my friend Dave Stein kicks off a new series he’s going on data warehousing.  Hope to see you there!

Ragged Flat File Processing in SSIS

Frequently I am asked how to properly use SQL Server Integration Services to handle ragged flat files, an example of which is shown below.  In this small sample file, you can see that we are dealing with medical data that has an inconsistent format.  As shown, the rows in the pipe-delimited file can have as few as seven and as many as eleven columns.

image

Although you and I can look at this file and interpret its meaning, previous versions of SSIS cannot (see note below regarding SSIS in SQL Server 2012).  You can see that we’re not only missing the data but the delimiters as well, which is the root of the problem.  This type of format where delimiters are simply left off when values at the end of the row are missing or NULL is, unfortunately, a relative common event that ETL developers must address.

With a file like this, SSIS will continue to consume data for each field until it find the specified field delimiter – even if that means that the data is “gobbled up” from the next line in the file!  Let’s assume that the outputs on my Flat File Source component define 11 fields (the largest number of fields found in this file snippet).  When SSIS processes the first row of data, it’s going to get to the end of that row still expecting to find a pipe delimiter to indicate the end of that field.  Rather than assuming the end of the line of data (represented by an unprintable newline character) is actually the end of the column, the Flat File Source component will continue to consume data until either it finds the specified delimiter or the length of the data consumed exceeds the width specified for that column.  In this example, the eighth field would not be the literal “453” at the end of the line in the file, but would contain a newline character followed by the first value of the following line of data.  Then the next field, which should be NULL, would contain “Reichert”, followed by the incorrect value of “781” and so forth.  As shown below, this slip would cascade through the remainder of the file.

file

In a case like this, the very best you can hope for is that the data length would overflow the allowed field width, or a data type conflict would cause an error and fail the package.  In a worst-case scenario, SSIS could actually consume the entire file with this misalignment as shown above, which may not raise an error in the package but would almost certainly cause downstream issues in the destination systems.

Scripting to the rescue!

Never fear – there is a solution to this problem.  By using just a bit of code in SSIS, we can easily get around this issue and process the ragged flat file.  The methodology we’ll use is as follows:

  • Using the Flat File Source, consume each line of text as one big string (rather than breaking it apart into individual fields)
  • Using the Script Component, parse each line of the file.  Dynamically populate the output fields with the data in each line of the file, and assign NULLs to those fields where no value is present.
  • Send the cleansed output to the destination table

Sounds easy enough, right?  Let’s go through this step by step.

Process each line of data

As mentioned, we’re going to trick our Flat File Source component into processing this file by forcing it to handle each line of data as if it were a single text field.  As shown below, I’ve configured a single output field, named [Column 0], as a field of type String with a width of 2000 characters.  This will include everything on the line of data, including the pipe delimiters, in the output.

SNAGHTML10b3b6d7

Next, we’ll add a script component into the data flow pane.  We’ll configure the input of the script component to use the single field generated by the flat file source above.

SNAGHTML10c1bb98

We also need to add all of the output fields manually, which we will populate from within the script.  As shown below, I’ve already added the eleven fields to the script component output, configuring each one with the appropriate data type and length.

SNAGHTML10c442c2

Now onto the fun part: The Code.  Within the script, our methodology will be as follows:

  • Break apart the big text string into its individual data fields, splitting on the specified delimiter (in this case, the pipe symbol).
  • If there are mandatory fields (i.e., any fields which must be present for a valid row of data), assign these to the appropriate output columns.
  • For the optional output columns, process the values retrieved from the input string, in order, assigning each one to the next output column until no other input values exist.

First things first – let’s grab and shred the input text.  As shown below, I’m going to split the input on the pipe delimiter, resulting in an array of strings, each with a single value.

// Break apart the input string into its atomic elements
string[] allValues = Row.Column0.Split(‘|’);

Next, we’ll work through all of the fields that are expected to be present in all valid rows of data.  In situations where there are mandatory fields which should always be present, no dynamic processing is required on those values, so we can simply assign them in order.  Note that I’ve used an incrementing counter variable to work through the array of input data – this is simply a shortcut step to allow me to copy/paste the same code rather than typing the cardinal numbers for each element.

// Counter
int i = 0;// Address the values known to be present in every row
Row.MRN = allValues[i++];
Row.PatientLastName = allValues[i++];
Row.PhysicianID = allValues[i++];
Row.ProcedureDate = allValues[i++];
Row.ProcedureID = allValues[i++];
Row.ProcedureDescription = allValues[i++];
Row.ProcedureStatus = allValues[i++];

Now we move on to the dynamic portion of the processing.  From this point, all columns are optional, so we need to go through the remaining values, one by one, and assign them to the final four output columns (all related to risk factor codes in this example).  As shown in the code snippet below, I’m continuing to work through the string array created earlier process all of the values.  As long as values remain in the array, we’ll continue assigning those values to the next risk factor code in order; when we run out of values, all remaining risk factor columns will be populated with a NULL string.

// The next four values are variable.  We may have zero to four of these, so
//  we’ll add them dynamically as they appear
Row.RiskFactor1 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor2 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor3 = allValues.Length > i ? allValues[i++] : null;
Row.RiskFactor4 = allValues.Length > i ? allValues[i] : null;

We only have four possible additional columns in this scenario, but this pattern can easily be applied to similar scenarios with more than just a few optional elements.

Consume the cleansed data

Finally, we just need to route the cleansed data out to our destination.  As shown below, we’re sending the processed data to a SQL Server table.

image

SQL Server 2012 changes

As mentioned earlier, this is a design pattern that I recommend for versions of SSIS prior to 2012.  If you are using SSIS with SQL 2012, this workaround is unnecessary: the Flat File Source was updated in this latest version so that it allows easy processing of these types of ragged files.

Conclusion

Ragged flat file processing can be a pain when working with older versions of SQL Server Integration Services.  This post demonstrates how, with just a few lines of code, you can avoid issues with ragged files by dynamically “peeling off” values and assigning them to the appropriate outputs.

Note: The package and sample data file I used for this demonstration are attached here.