Through the course of my 8-someodd years of building and fixing ETL processes, I’ve had the opportunity to see a lot of ETL code. Some of that code was really good, well-thought-out and carefully executed. Other load processes were – well, let’s just say that they provide plenty of consulting opportunities (and I include much of my early code in the latter category). I’ve discovered patterns that tend to lead to success in an ETL developer job, and others that are just one failure away from a career catastrophe. It’s the latter category I’ll discuss in this post. Specifically, I’ll share with you…
How to Get Fired from an ETL Developer Job
Don’t get me wrong. I don’t wish for any of you to get fired (well, I don’t wish that for all of you). Rather, I want to share some antipatterns – most of which I’ve been guilty of using during my apprenticeship days – to help avoid those résumé-generating events.
Don’t use execution logging tools
I remember the first production ETL process I ever built. It was an SSIS package built on SQL Server 2005, and that thing was a behemoth – 5 megabytes of XML metadata alone. But it was “the package” – as in, the only one we had at the time. Because I was running it manually (a requirement of our business process at the time), I knew every inch of that package – normal runtimes, probable failure points, and expected warning messages. As such, I didn’t set up much in the way of logging. However, I quickly learned that this wasn’t a sustainable design. As that one package eventually begat dozens more, it was clear that my manual observation model wouldn’t scale. I had to build logging into all of the packages so that I (and those who were to come after me) would have a reliable and repeatable method for capturing metrics for package executions. If we had had a catastrophic failure of “the package” and my powers of observation had not been enough to troubleshoot the problem, I would have found myself in a real pickle.
After many years in the business since “the package”, I’ve found that developing a proper logging strategy isn’t something that only I struggled with. It’s a common issue even now, when simple and easy-to-use logging is built into most every ETL product.
To be frank, logging isn’t a sexy topic (in ETL or otherwise). But it’s a cornerstone of any mature ETL process. Logging provides insights such as expected runtimes and typical runtimes, which can help you identify patterns of normalcy and plan for growth. Unless you want to find yourself talking to HR about the diminished future of your ETL developer job, don’t skimp on this part.
Don’t set up exception notifications
“If a tree falls in a forest and no one is around to hear it, does it make a sound?” One could ask the same question about ETL processes. If your general ledger load fails overnight and the support team doesn’t know about it, you may not know anything is wrong until the financial folks arrive at 6:30am (Aside: why do financial people always go to work so early?) and they’re at your door demanding answers. A process failure is never a good thing, but being blindsided by said failure can turn an unhappy day into a career-altering one.
Every enterprise ETL scheduling tool has some mechanism for notifying personnel of failures and other exceptions, most of which support a variety of notification types. There’s simply no excuse for not setting up proper notifications.
Don’t audit your ETL loads
“The load ran without errors, so it was successful.” I always cringe when I hear these words, because this is akin to stating, “The code compiled, so the application works.” The absence of errors or explicit warnings does not necessarily imply that an ETL process has worked as intended; it simply means that the process has not encountered any known technical or business logic errors. There could still be any number of conditions – rows lost or mishandled during the ETL process, out-of-balance financial load, and other implicit errors – that do not force the ETL mechanism into an error state. The big risk here is that you may not know there is a problem until hours or perhaps day afterward, after the users have already been making business decisions on that bad data. That’s going to be a very unpleasant conversation.
Building in auditing routines to ETL processes isn’t a trivial task, and it may take some selling to management to get approval to invest the time in it because it doesn’t directly add functionality. However, for any critical data load, auditing metrics such as financial totals and row counts is essential for long-term and trustworthy load processes.
Don’t use source control
Make no mistake: ETL processes are still code, and should be managed as such. I see this a lot, especially with graphics-driven ETL tools such as SSIS. It’s easy to forget that, even though there’s a pretty GUI interface, the logic is still just plain ol’ code, and it needs to be managed using some source control mechanism. The failure to use source control for ETL code often hints at deeper issues such as inconsistent release management strategies and sketchy backup histories. These things are catastrophes waiting to happen.
This is likely the most common issue from this list that I encounter in the real world – either clients don’t consider ETL logic to be source code and therefore they don’t use source control, or even more frighteningly, they don’t even have a source control system. I’ve run into this problem so often that I routinely offer to my clients to set up an online source code repository for them, just to ensure that they at least have the means to easily archive and version their ETL code base.
Don’t document your packages
Think back when this has happened (and yes, it’s happened to us all). Someone asks about the behavior of a particular piece of code, whether it’s an SQL script, an SSIS package, or a C# application. You go back to the code, staring at it to try to understand the original intent. You ask yourself, “Who wrote this crap, and what were they thinking?” And then it hits you: “I wrote that code.” This is one of the lies we well ourselves – that we’ll remember why we built things the way we did, even without documentation. In reality, we end up building so many things that is it nearly impossible to remember exactly what we built two weeks ago, much less years down the road.
It’s no secret that documentation is among the least favorite development tasks to engage in. There’s little excitement. You’re not really building anything. And it must be constantly updated. What a drag! However, as mundane as it is, good documentation can make or break an ETL project. Proper documentation helps analysts understand how business logic is being applied in the ETL. It can provide auditors with the evidence they need to visualize data lineage patterns. And most importantly, it helps you as the ETL developer to keep you from having to decipher your own code.
If you’re developing new processes, be sure to properly document them. If you take on a new ETL developer job and you find an absence of existing documentation, you’ll make your job easier and learn a lot in a short amount of time by building documentation for those processes.
Hard-code connection strings in your packages
Imagine this: You’re building a green field ETL system for a new data warehouse, and you’re trying to crank out code as quickly as possible to meet a deadline. To save time, you just type in (or copy+paste) the connection strings for your source and destination connections rather than building a proper configuration table. Since the whole project involves a couple hundred packages, you can save hours of time by using this shortcut. However, a year down the road, the OLTP server is replaced with new hardware, which also means a new server connection string. You now have the unenviable task of telling your boss that you have to modify, redeploy, and retest every single package for the data warehouse load, when this could have been done with a simple configuration table or file.
Hard-coding values that should be externalized is among the group of sins classified as technical debt. Early on it’s very easy to take these shortcuts, but eventually you’ve got to pay the price for delaying good design.
If you decide not to follow this instructional guide on how to get fired from an ETL developer job, is there any guarantee that you still won’t find yourself in hot water? Of course not. But if you’re an ETL professional – or if you’re a DBA or developer who’s been tasked with building ETL processes in addition to your regular duties – avoiding these pitfalls can help to avoid some of the more common traps I’ve observed.
Good post – Any advise on how to document packages? Annotation or anything a bit more elaborate?
Paul, I use a couple of different methods. For quick tips in the SSIS packages themselves, I use annotations. I also externally document them using Word, Visio, or both for more formal documentation.
Great points Tim. Fortunately I am using almost all of them except documentation. I am using SSIS as an ETL tool and put annotations in the control flow. But my data flow is complex and i am not sure how to document that within my package. Is there a way for that beside creating external document to detail out the data flow?
Prateek, that’s a great question. Very often I have an external mapping document that defines complex data flow mappings. But I have seen documentation methods that include taking screenshots of the data flow mappings and pasting that into a Word document where the mappings can be clarified. I’ve also heard good things about a tool called DocXpress, but I’ve not used it myself.
Tim, great post. Do you have any best practices for establishing connection strings in SSIS 2012? Specifically, best practices surrounding connection strings during the promotion of SSIS packages between a DEV/TST/PRD environment.
Josh, thanks for the comment. For 2012+ projects using project deployment, I recommend using SSIS Environments to manage connection strings and other critical values. These can streamline the process of moving from one realm (Dev, QA, Prod) to another.
An excellent read. Nicely done Tim!
Thanks Jim! Glad you enjoyed it.