For the better part of 15 years, SQL Server Integration Services (SSIS) has been the go-to enterprise extract-transform-load (ETL) tool for shops running on Microsoft SQL Server. More recently, Microsoft added Azure Data Factory (ADF) to its stable of enterprise ETL tools.
In this post, I’ll be comparing SSIS and Azure Data Factory to share how they are alike and how they differ. I’ll also review the strengths and shortcomings of each, including the architectures in which each of these is likely to do well.
Overview of SSIS
Regular readers of this blog are almost certainly aware of what SSIS is, but for those who are new, here’s a quick recap. Integration Services is ETL software that comes with commercial versions of SQL Server. SSIS is installed on the same server as a SQL Server instance, and encapsulates its functional units of ETL work into packages which can be executed individually or in groups.
SSIS is installed software, meaning that it does require an actual machine – either physical or virtual – on which to run (with one notable exception, to be discussed shortly). Because it is installed software, it is quite customizable, either through the scripting tools built into it (the script task and script component) as well as a variety of third-party vendor offerings that work seamlessly in SSIS.
Once installed, there are no ongoing costs to run SSIS packages. Even better, you can create a development environment using SQL Server Developer Edition, which costs you nothing in licensing fees as long as you’re using that server exclusively for non-production activities. SSIS has been around for more than 15 years – much longer, if you consider that it was based on Data Transformation Services that preceded it – so it is a very mature and stable product. Although it was clearly developed with an on-premises work load in mind, it has been adapted over the years to support both local and cloud-based data connections.
Overview of ADF
Azure Data Factory is a serverless ETL tool based on the popular Microsoft Azure platform. Of the two tools, this one is much newer, having been released around 2014 and significantly rewritten in its second version (ADF v2) around 2018. Note that most everything you’ll find on ADF is now in v2, as the original version was quite spartan and not at all user friendly.
ADF was built as a cloud integration engine, and its data connections natively support most types of cloud endpoints. With few exceptions (such as the self-hosted runtime, which I’ll cover in a future blog post), the entire works runs in the cloud, meaning there’s no software to install and no operating system to configure. In ADF, a data factory contains a collection of pipelines, the analog to the project and package in SSIS, respectively. A pipeline can have multiple activities, mapping data flows, and other ETL functions, and can be invoked manually or scheduled via triggers.
Because it is a service rather than software, its cost is based on usage. The flexible licensing model means that you can pay as you go rather than buying a license up front. Scaling up or down is fast and easy in ADF, and the cost reflects the level of performance you have configured.
Comparing SSIS and Azure Data Factory
Before comparing these two products, let’s get a few fundamentals out of the way first:
- This author has an experiential bias toward SSIS, since that tool has worked well for me throughout my entire consulting career. (Not that this article is purposefully leaning either way, but I like to point out any potential that my bias could leak into an otherwise objective analysis.)
- ADF is not just “SSIS in the cloud”. While these are both ETL tools with some amount of shared functionality, they are completely separate products, each with its own strengths and weaknesses.
- ADF is not a replacement for SSIS. Microsoft is clearly continuing to support SSIS, and with its ubiquitous use in enterprises worldwide, it’s not likely to be deprecated any time soon.
There is one exception to the “ADF isn’t cloud SSIS” statement. Azure Data Factory, in addition to its native data factory functionality, allows for the creation of an SSIS runtime to store and execute SSIS packages in much the same way one would do in an on-prem instance. This ADF SSIS integration runtime (IR) allows organizations that are slowly migrating to the cloud or need to retain a part of their existing SSIS infrastructure to move to ADF while keeping those SSIS assets intact. This IR looks very much like a conventional SSIS catalog, with a familiar user interface experience and all the same monitoring tools. It also runs entirely as a serverless service, eliminating the need to install SQL Server or maintain a server machine just for running SSIS packages.
Both of these are enterprise-ready ETL tools, so you can’t go wrong with using either of them for mission critical and high performance needs. Below I’ll be comparing SSIS and Azure Data Factory based on learning curve, functionality and features, ease of administration, performance, and cost.
Getting started with any new software or service takes time and practice prior to mastery. When considering the learning curve of any tool, one must consider the following:
- If someone were sitting down in front of this tool for the first time, could they understand the basics?
- With minimal guidance, how quickly could the average data professional be productive at this tool?
- How easy is it to practice in a safe sandbox environment?
- How well documented is this tool?
- How many skilled experts are available on this tool (in case you need to hire another team member or contractor)?
With both SSIS and ADF, it’s reasonably easy to perform simple tasks. If one has worked with any ETL tool before and understands the concepts of data connections, sources, destinations, and transformations, building a simple demo load process should require no more than a couple of hours and a few queries of their favorite search engine.
As far as the time it takes to become productive, I’m going to give a slight advantage to SSIS here. Because SSIS has been around for so long and hasn’t had many major changes since 2012, its maturity and slower evolution make it less of a moving target than its younger cousin.
Both of these allow practicing in a safe environment. In each, it’s up to the administrator setting up the environments to ensure that it the sandbox is isolated from live production data.
Both SSIS and ADF have documentation both from Microsoft and the community at large, but SSIS is the better documented product. Again owing to its age and glacial pace of changes, it’s easier for Microsoft, bloggers, and book authors to write about SSIS, and the body of work around it has a nearly 10 year head start on ADF.
There are also more data professionals who are highly skilled at SSIS than ADF. However, this is rapidly leveling out as ADF leaps forward in popularity. It’s not exceptionally difficult to find help on either SSIS or ADF, but the former has a (likely temporary) advantage here.
Functionality and Features
Both SSIS and ADF have essential functionality that one would expect in an enterprise ETL tool, including:
- Connectors to a variety of data structures, both relational and nonrelational
- Workflow / order of operations
- Data flow transformations to allow cleansing or reshaping of data in flight
- Looping logic (for and while operations)
- Error handling at the task level
- Parameterization for dynamic behaviors
Although both tools have the bases covered on essential functionality, I’ll give SSIS the nod here in terms of the number of operations that can be done out of the box. Integration Services has a larger library of built-in control flow and data flow functionality, and it has granular row-level error handling capabilities that are not found in ADF. The native components in SSIS (including the free Microsoft Azure feature pack) and the numerous free and commercial add-ins make it so that you’d hardly ever need to write code in SSIS. I expect we’ll see more feature parity as ADF continues to evolve. But for now, even if yours is an ADF shop, you can create an SSIS package and deploy it to the ADF SSIS integration runtime to allow for that SSIS-specific functionality.
Ease of Administration
Azure Data Factory and SQL Server Integration Services are each equipped with functionality to make it easy to manage and monitor, including:
- Execution logging
- Source control integration
Both products do a good job at each of these necessary operations.
Performance is difficult to compare, because there are so many different variables involved. For my presentation on this topic at the PASS Summit last November, I did a simple performance comparison on SSIS versus ADF. In doing so, I created a set of load operations on each tool that loaded a small file (21kb) and then a larger file (300mb), and for each I tried it by doing a simple copy operation as well as a full ETL with transformative logic. For this test, I configured a very minimal set of resources on both SSIS and ADF. In doing this test, here’s what I found:
- With the small file with no transformations (simple copy), performance on SSIS and ADF was almost identical
- With the small file and some lightweight transformations, SSIS performed much, much faster than ADF (8 seconds versus 180 seconds)
- With the large file with no transformations (simple copy), ADF was much faster (75 seconds versus 408 seconds)
- With the large file and some lightweight transformations, ADF was faster (300 seconds versus 408 seconds)
Part of the reason that ADF ran slower than SSIS on the small file was that, when using a mapping data flow (commonly just called a data flow) in ADF, it has to spin up a Databricks cluster in the background to parallelize that workload. For that reason, small loads on ADF would perform better if designed as an ELT (extract-load-transform) operation rather than the conventional ETL pattern.
In terms of performance, both SSIS and ADF have controls to help optimize runtimes. However, in most optimized ETL or ELT operations, the answer to performance issues is to throw more resources (memory, CPU, disk I/O) at it. This is where ADF is simpler than SSIS, because you can easily change the performance settings with a few clicks, and can do so just for a single execution if desired. ETL operations aren’t necessarily running around the clock, so having the ability to pay as you go and crank up the performance knob can make for greater flexibility and lower cost.
I’m going to give the edge to Azure Data Factory on performance, based largely on that fast and easy scale up/down flexibility.
Of all of these metrics, cost is the most difficult to compare. Because of the different pricing models of SSIS (fixed licensing) and ADF (pay as you go), there’s no single litmus test that will identify the more cost effective option. Instead, consider the following questions when evaluating SSIS versus ADF:
- Is my organization more comfortable with fixed CapEx (capital expenditure) spending, or usage-based OpEx (operational expenditure) charges?
- Should we plan to start with a small cost until our business takes off, and then pay more when we’re processing more data? Or do we just go ahead and buy the hardware and software we think we’ll need for future growth?
- Do we need constant computing horsepower on demand, or are our ETL needs more cyclical and concentrated at certain times of the day?
When considering cost, keep in mind that you need to account for all the costs that go into supporting a particular architecture. For example, if one is running an on-prem SQL Server instance, it’s very easy to leave out of the equation the time required to install the OS, maintain patching, hosting costs (including the physical space to do so), and other administration not normally considered part of database maintenance.
There’s no clear winner here, because ADF (including the SSIS IR services) has a completely different pricing model than on-prem or VM-based SSIS instances. Knowing how your organization processes data, how it expects to grow, and how they handle their budgeting process is essential in making the right financial decision.
As you might have expected, I’m not going to tell you what’s best for you. Both SSIS and ADF are highly capable enterprise ETL tools, and each can succeed when used properly. If you’re at a fork in the road trying to make a decision on which of these tools to use, here are a few scenarios in which each of these does well.
SSIS works well when several of the following statements are true:
- Your workload is mostly or completely on-premises
- You already own SQL Server 2016 or newer
- Your ETL processes are running consistently throughout the day, not concentrated just on certain times of the day
- Your organization already has an investment in SSIS assets
ADF is a great choice when several of the following describe your setup:
- Your organization is growing, and you want to pay for what you are currently using, not what you plan to use in the future
- Most or all of your workload is in the cloud
- You have spikes of activity when your ETL processes are running (nightly, 2x daily, etc.)
- Your leadership is comfortable with a variable, consumption-based pricing model
Every shop’s needs are different, so you’ll have to consider each of these factors when deciding whether to use Azure Data Factory, SSIS, or a hybrid of the two.
Any good craftsperson will tell you that you should pick the right tool for the job. The same applies when comparing SSIS and ADF to select a Microsoft ETL platform. So remember that SSIS and ADF are not mutually exclusive. Don’t think of these two tools as foes, but as complementary to each other. If you have existing investments in SSIS but are planning on moving some of your data estate to the cloud, you can spin up a data factory to interact with those cloud-based assets. Likewise, if you are mostly an ADF shop but have a need for the flexibility of a C# script component, you can deploy an SSIS package either locally or to an ADF SSIS runtime separate from your data factory ETL. Choose the tool most suitable for the job at hand.