Eliminating Empty Output Files in SSIS

So you’ve got some packages that regularly extract data to one or more text files, but you know that from time to time some of the queries will not return any data. However, you find in SSIS that, in a flat file export package, the output file is created regardless of whether any rows are written to the file, and in cases where there are no rows returned from the source, you’ll end up with empty data files in your output directory. Although not a critical failure, having empty output files can be a nuisance, and unfortunately, there aren’t any native settings in the Flat File Destination or the Flat File Connection Manager objects that will clean up empty files. Fortunately, SSIS does provide a combination of other tools that you can use to emulate this behavior.

In our example, I’m going to create a package to extract data from a table using a query for which I know that no rows would be returned. When I connect this data source to a flat file destination and execute the package, I’ll see an empty data file in my export directory. Next, to demonstrate the intended behavior, I’ll add a Row Count transformation to store the number of affected rows in a variable, and create a File System Task object to delete the output file. Finally, I’ll use an expression constraint to only delete the file if the variable attached to the row count is zero.

First, let’s create a test table to query:

USE testdb
GO

CREATE TABLE invoices (
invoiceid INT IDENTITY ( 1 , 1 ),
vendorid INT,
invoicedate DATETIME,
invoiceamount DECIMAL(10,2))

INSERT invoices
VALUES(12, ’8/3/2009′, 4125.50),
(53, ’8/13/2009′, 1095.25),
(46, ’8/15/2009′, 729.50),
(33, ’8/23/2009′, 3421.50)

Now, I’ll create a basic package that will export to text the invoices for the past 30 days, a reasonable business requirement. Since we don’t currently have any invoices matching that criteria, we’ll end up with an empty output file. The original package is shown below:

df_f1

 

Now for the new-and-improved version, I’m going to drop a Row Count transformation into the data flow to save the number of affected rows to a variable:

 

df_f2

 

Finally, I’ll create a File System Task and configure it to delete the output file. To insure that a valid data file is not deleted, I’ll create a precedence constraint using an expression to only execute the delete if the row count variable is equal to 0:

 

df_f3

When you execute this package, you’ll see that the File System Task object is executed because there are no rows matching our query. You can test the package by inserting another row into the database that will be returned by the query, and you’ll see that the data file is exported but not deleted.

I’ve attached the before and after SSIS packages if you’d like to take it for a test drive. Enjoy!

ssis_eliminate_empty_output_files.zip (12.62 kb)

Space Sensitivity in SSIS Lookups

It’s been well-documented through myriad blogs and forum posts about the case sensitivity of the comparisons in the SSIS lookup transformation (a good review can be found here). In a nutshell, a comparison using the lookup transformation is case sensitive when using the default setting of Full Cache, even if the values in the database are stored in a case insensitive collation. This happens because, in Full Cache mode, the comparison is done by the SSIS engine rather than the database engine, the former of which differentiates between “Value”, “VALUE”, and “VaLuE” regardless of the collation setting.

But did you know that this transformation is space sensitive as well? Consider the following T-SQL code:

SELECT Cast(‘char value’ AS CHAR(20)) [val1]
INTO   test1

SELECT Cast(‘char value’ AS CHAR(40)) [val2]
INTO   test2

SELECT t1.val1,
       t2.val2
FROM   test1 t1
       INNER JOIN test2 t2
         ON t1.val1 = t2.val2

As you would expect, executing this code results in a successful match (INNER JOIN), even though we’re comparing CHAR values of differing lengths (for more information, see this article for more information about spaces and padding in SQL Server).

However, when the same comparison is run through an SSIS lookup transformation in Full Cache mode, the lookup on our sample data will fail. Similar to the case sensitive lookup, you’ll find that the SSIS engine would treat the strings ‘Hello World’ and ‘Hello World  ‘ (note the trailing spaces) as dissimilar values.  Unlike in SQL Server, trailing whitespace is significant in SSIS value comparisons.

As a workaround, you can use the TRIM() function in SSIS and the RTRIM() T-SQL function to insure that your comparisons are ignorant of trailing whitespace.  Alternatively, you could use a cache mode other than Full Cache, but you should be aware of the other implications before making such a change. 

Note that this behavior is limited to fixed-length character fields, but could lead to some unexpected and hard-to-detect problems if you aren’t aware of the behavior.

The Netbook: Upgrade to Windows 7

I’m more than 2 months into the netbook experience, and I’m happy to report that it’s still a good investment.  Last weekend, I made the leap to Windows 7 RTM.  So far so good!

I was more than a little surprised to find out that there was no upgrade path from Windows XP to Windows 7.  Although there are a few free utilities that can help ease the pain by migrating user settings and documents, a move from XP to 7 will involve a completely new install.  That said, I’m usually not a fan of an OS upgrade anyway, since these tend to be less stable than a clean install in the long run.  Fortunately, I’ve not had this machine long, so it didn’t take long to back up all of my data for the reinstall.

The installation of the OS and reinstall of my apps took a good part of the afternoon.  Moving to an operating system 8 years newer, I expected a performance drag due to the additional overhead, but was pleasantly surprised at the snappiness of Windows 7 on the netbook.  I’ve noticed that I’m under a bit more memory pressure than I was with XP, but I’ve still got the standard 1gb of RAM installed (I can go up to 2gb on this model).  The battery life seemed to have diminished a bit, but after reinstalling the Toshiba power management software, it’s still in the neighborhood of 7 to 8 hours.

I’ve noticed that if I have SQL Server, SQL Server Management Studio and Visual Studio running at the same time, the CPU tends to run higher than normal.  I have a little more disk paging than before, most likely due to the memory pressure.  Overall, though, the jump to Windows 7 has been a positive move for me.

More Free Training Resources

A recent blog by Jack Corbett (No Training Budget Still No Excuse, followed up by SSC’s Steve Jones here) hit the nail on the head with respect to taking charge of one’s own career.  I won’t belabor the point by repeating their words here, but the message is that everyone has access to quality SQL Server training even if your training budget is a goose egg.

Jack’s article pointed out a number of organizations and websites that offer free SQL Server educational resources, and I was pleased to find one asset on his list that I had never used.  I’ll add on a few Microsoft BI resources that I’ve discovered in my travels:

http://www.learnmicrosoftbi.com – A website with dozens of videos on Microsoft BI, mostly on SQL Server Analysis Services

http://www.sqldts.com/ and http://www.sqlis.com/ – Sister websites with resources for SQL Server DTS and SSIS, respectively

Pragmatic Works webinars – Brian Knight and company have published several of on-demand webinars for business intelligence professionals

SQL Lunch – This event is just emerging, but will start in September and offer semi-weekly lunchtime webcasts on a variety of topics (follow Patrick LeBlanc’s blog for more details on this one).

MSSQLTips.com – A SQL Server resource site, which, among other offerings, sends out a SQL Server Tip Of The Day each weekday via e-mail.

Microsoft Podcasts – Not just for SQL Server, there are Microsoft podcasts and other resources for almost every Microsoft technology.  If you spend any time commuting, burn some of these to CD and learn on the road!

The Midnight DBA – According to their own description, husband and wife DBA team Sean and Jen McCown put the kids to bed and stay up late making videos.  [OK, they’re SQL Server videos – stop snickering.]  There’s a wide variety of content here, and is well worth a visit.

Twitter – OK, this one is reaching just a bit.  I add this because there is a fair number of SQL Server pros on Twitter, many of whom are more than willing to offer advice, share upcoming events (webcasts and such) and “tweet” interesting information.  SQLServerPedia has a good index page listing many of the SQL folks who use Twitter.

Upcoming Speaking Engagements for Q4

In the next few months, I’ll be giving a couple of talks on SQL Server business intelligence.  For October, I have the pleasure of presenting for SQL Lunch, a new online learning series pioneered by Patrick LeBlanc.  On October 12th, I’ll be discussing ways to leverage SQL Server Report Builder 2.0 against your existing SSRS infrastructure to allow users to create their own ad-hoc and published reports.

Also, I’ll be visiting the Ft. Worth SQL Server User Group on November 18th.  I’ll be discussing Intermediate SSIS, demonstrating some of the more complex tasks and transformations and demonstrating how to leverage these tools to assist with challenging ETL scenarios.  Hope to see you there!

Skipping Items in a Foreach Loop

Recently, my friend Jack Corbett asked a question on Twitter:

Any way to continue' a For each file enumerator in SSIS if the file name == a variable?

In a nutshell, the SSIS foreach loop will enumerate a given list of items (files in a directory, nodes in an XML file, static list of values, etc.) and will perform some operation for each of the items in the collection.  This behavior is similar to foreach loop constructs that are found in most high-level programming languages.  These programming elements normally include a continue statement, which allows you to stop processing of the current item and move on to the next one.  Jack hoped to find this capability built into the SSIS foreach loop, but unfortunately it doesn’t exist natively.

However, there is a workaround that will allow you to reproduce the continue statement in the SSIS foreach loop.  By including a “dummy” data flow within the foreach loop, we can use a precedence constraint to skip some of the items in the list based on the value of a variable.  To demonstrate, I’ll use a foreach loop to make a copy of all of the files in a given directory except for one specific file.  I’ll first set up the foreach loop in the control flow pane to enumerate over the files in a particular directory, by using the Foreach File Enumerator setting:

figure1

Now, within the foreach loop above, I create two data flows; one that does nothing at all, and the other that reads the current file in the list and writes the data out to a new file, the name of which is set using an expression. 

Next comes the important part: I create a precedence constraint from my dummy data flow to the one that actually performs the work, and I’ll edit the expression to exclude one of the file names:

figure2

figure3

In this scenario, the package flow will only move from the dummy data flow to the one containing our business logic if the variable value, in this case the current filename, does not match the one we intend to skip.

So the short answer is that the SSIS foreach loop doesn’t have native capability to skip processing for certain values, but using a do-nothing container and a precedence constraint within the loop is a creative and effective substitute.

[Cross-posted from SQL Server Central]

Continue in Foreach Loop.dtsx (46.79 kb)

North Texas SQL Server User Group Meeting on Thursday

The monthly North Texas SQL Server User Group (NTSSUG) meeting will be held this Thursday, September 17, at 7:00pm at the Microsoft campus in Irving, Texas.  Our guest will be SQL Server MVP and consultant Geoff Hiten, who will be presenting “Bad SQL – Why Does This "Perfectly Good" T-SQL Run So Slow?”

Admission is always free, and pizza and drinks will be provided.  Bring a friend!

For more information on this event or the NTSSUG group, visit our website at http://northtexas.sqlpass.org/.

Getting Into the Biz, Part 3

In a couple of previous posts (Part One and Part Two), I shared some thoughts about starting a career path as a database professional, including some notes from my own journey.  In this final segment, I’ll share a few discrete tips that will hopefully be useful to those interested in this field.

Build Your Foundation: You’re going to need some base skills outside the immediate scope of SQL Server.  Regardless of what role you pursue (DBA, database developer, or BI engineer), you need to have at least a decent understanding of the technologies that support SQL Server.  Get to know how Active Directory works, particularly with respect to users, groups, and authentication.  Learn about Windows server architecture, including clustering technology.  Spend some time learning about disk configuration, RAID levels, and SAN architecture.

Get Training: Be proactive in training yourself. If your employer won’t provide training, don’t use that as an excuse – remember that you are the one with the most skin in the game in your career development.  Take advantage of the training resources available to you, many of which are free (here’s a good list to start with).  Get a developer copy (nominal cost) or an evaluation version (free) of SQL Server, and start practicing.  Get a couple of books and spend the time reading them – in fact, most beginner SQL Server books will have some examples you can work through to get some good hands-on work with the product.

Get Experience: Spend some time getting your hands dirty with real projects.  Even though you won’t find a DBA job without having previous experience, there are ways to build tangible, legitimate, documentable experience even if you’re not working full-time as a database professional:

  • Get experience at your current job.  Do you have a DBA at your place of work?  Spend some time with that person or team, and find some ways you can assist them, even if you’re performing menial tasks during your lunch break.  If your place of work doesn’t have a DBA, are there currently any SQL Server instances there?  Volunteer to run the backups, or to do a test restore to test your disaster recovery procedures.  The key here is getting your foot in the door; prove you can do a good job with small things and you’ll eventually receive more responsibilities.
  • Volunteer.  There are a lot of organizations that would welcome volunteer labor to assist with their technical needs.  Talk to the folks at your kids’ school, local nonprofits, places of worship and the like, and find out if there are any SQL Server-related projects needing some attention.  You may even find structured volunteer opportunities, such as GiveCamp (FWIW, I’ve participated in GiveCamp for a couple of years, and it’s a great opportunity).  Even though you’re working for free, you’re building experience on legitimate projects which will improve your skills and provide documentable evidence of your work.

Get Involved in the Community: Much has been written about the benefits of being involved in the SQL Server community, and I won’t try to reproduce it all here.  Suffice it to say that active involvement with other similarly-minded professionals provides an avenue for learning and advice, and can expose career opportunities not otherwise available to you. Join PASS and take part in their free webcasts.  Find a local SQL Server user group and attend regularly.  Join Twitter and follow some SQL Server tweeps. The community is open to newcomers, and most venues welcome even the most basic questions for those trying in earnest to learn about SQL Server.

Pay Your Dues. Know up front that you’re not going to walk straight into a DBA job.  With hard work and perseverance, you can eventually work your way into a role where you’re handling difficult problems and working on cutting-edge projects.  To get there, however, you’re going to have to start small, running database backups, changing tapes, and reviewing log files.  Note that paying your dues may include some seemingly unrelated tasks (see Build Your Foundation above).

Above all else, keep your chin up – don’t get discouraged by the “Can’t get a job without experience, and can’t get experience without a job” cycle.  Put yourself in an employer’s frame of mind: would you want to hand over the keys to the kingdom to an untested, inexperienced rookie?  Know that it won’t happen overnight, and that you’ll most likely take a series of small steps rather than a huge leap into this field.  Prove yourself worthy with the small stuff, and the big stuff will come in time.

Good luck to you!

[Cross-posted from SQL Server Central]

SQL Saturday Dallas Update

Just a quick update on the SQL Saturday event we are planning for the Dallas area next year.  Late last week our local event planning committee met to discuss the big stuff (venue, date, and advertising) and came away with a short but critical action list. 

We’ve got a venue that has given soft confirmation to accommodate us, and we’re working to lock that in this week.  For the event date, we had to dodge a number of other events in our January/February timeframe; the remaining Saturdays were January 23 and January 30, and we expect to have a decision between the two in the next week or so.  We’ve got some folks who are contacting potential advertisers, both international and local, to help fund what we expect to be an event with 500 or more professionals in the database, development, and business fields.  [Sidebar: If you are a business principal interested in sponsoring this event, let me know – tdmitch at gmail.]

There are still significant decisions to be made and lots of work to do.  Once the date is set, we’ll publish the event to the SQL Saturday website and open it up to registrations and speaker abstracts. Later, the abstracts will need to be reviewed and the schedule created from the submitted sessions.  I’ll also be contacting several high-profile speakers to see if they can work this event into their schedule.  We need to arrange catering for lunch, along with donuts, snacks, and drinks for the remainder of the day.  We will be seeking a host hotel for out-of-town attendees, and the speaker reception and post-event party will need to be set up.  That’s just the big stuff – there are flyers to print, conference bags to stuff, tables to set up and tear down, and hundreds of other little things that (hopefully) will be planned well in advance.

Fortunately, we’ve got a good group that has expressed a willingness to get involved.  Our combined area user groups (Dallas and Ft. Worth) represent hundreds of technical professionals, and we’ve already had a number of volunteers come forward and offer to help organize the event or present a session.  Several of our group, myself included, have been to SQL Saturday events before and have been witness to what works and what could use improvement.

In the meantime, we’re still gathering a list of volunteers.  If you’re interested in helping plan or execute this event, please let me know and I’ll add you to our volunteer contact list.  We’re also building a list of those interested in presenting, and we’ve already heard from several of you who would like to speak.

I’ll keep the updates coming as things develop.

Updating Existing Data with SSIS

I see a lot of questions on the forums about updating existing data in SSIS.  When the update is dynamic and is based on elements within your data flow, a common mistake is to use the Ole DB Command within the data flow pane:

figure1

 

The above poorly designed update has us retrieving data from some source, and then running an update on our target database based on one or more values in the source.  Does it work?  Sure.  The problem occurs when you touch more than a handful of rows.  When you bring in the Ole Db command into the picture in a data flow, you’ll be firing the statement in this control once for every row of data in your pipeline.  So let’s say our source retrieves 100,000 rows into the data flow: the downstream UPDATE command will be executed 100,000 times!  Such operations could bring the most capable server to its knees.

A better solution could include staging your data.  Using this method, you’ll retrieve the data from the source and write it into a staging table in the destination database.  You can then use an Execute SQL task to run your update in a more organic manner.

figure2

Set up your data flow as shown above to pull in the data to a staging table, then you can run a single UPDATE statement:

UPDATE i
SET InvoiceAmount = st.UpdatedInvoiceAmount
FROM Invoices i
INNER JOIN StagedData st ON i.InvoiceID = st.InvoiceID

The advantage here is that you’re executing the expensive UPDATE statement once for each table rather than once for each row affected.

Of course, there are some situations that explicitly disallow the use of staging tables in destination systems.  If storage or access restrictions keep you from using this method, you may have to use the row-by-row insert, so be aware that it’s going to be a bottleneck.

[Cross-posted from SQL Server Central]