Azure Data Factory has grown in both popularity and utility in the past several years. It has evolved beyond its significant limitations in its initial version, and is quickly rising as a strong enterprise-capable ETL tool.
In my last post on this topic, I shared my comparison between SQL Server Integration Services and ADF. In this post, I’ll walk through the essential tasks of creating your first Azure data factory.
Creating Your First Azure Data Factory
To create a new data factory, you’ll first log into your Azure portal. If you haven’t yet set up an Azure account, you can sign up here to get started.
Once you’re logged into the Azure portal, the default view will show those resources that were recently created or used. You’ll also see a list of shortcuts to some of your commonly used services.
Click on the link to Data Factories. If you don’t immediately see it in your shortcuts, you can use the Azure search bar to search for it. On the Data factories window, you’ll the list of data factories you’ve created (if any). From here, you can click the Add button to begin creating your first Azure data factory.
Create a New Data Factory
As shown below, the Create Data Factory screen is fairly simple. Select the subscription, resource group, and region, and give it a name – that’s it. You can also specify the data factory version, but the only real option here is V2 (trust me when I say that ADF V1 was very V1). You’ll also notice the option to Enable Managed Virtual Network, which I’ll cover in a later post.
After creating new data factory, the next step is to open the editor. Shown below is the new, empty data factory ready to be configured.
For this example, I’m going to build a simple import that will perform the following tasks:
- Connect to a file named EmployeeStatus.txt in an Azure blob storage account, and load that data to a table of the same name in an Azure SQL database
- Connect to a file named Employees.txt in an Azure blob storage account, and load that data to a table of the same name in an Azure SQL database
I’ll use a single pipeline named Dunder Mifflin Employees, and will include two activities in that pipeline, one for each of the files I will import into the database.
For each of these operations, I’m using the copy data activity. This is one of the simplest activities available in ADF, and allows us to grab all of the data from a source and load it all to a destination (commonly referred to as a sink in ADF). To configure the settings, click on the activity to reveal the properties window as shown below.
Setting up Datasets and Linked Services
Most of the structured data movement activities require the creation of an underlying dataset. A dataset in ADF defines the connection and metadata for a set of structured data used as either a source or a sink. Each of the copy data activities I’m using requires two datasets – one for the source, another for the sink. You’ll notice that there’s an attention flag (the red number 1) beside Source and Sink in the properties window, indicating that the datasets have not yet been defined for either of these.
Before creating the source dataset, I’ll have to first create a connection (called a linked service in ADF) to the storage account, named DunderMifflinDataFiles. Both of the files are located in the same folder in that storage account, so I can just create a single linked service and then specify for each copy activity the file I want to use. Shown below is the configuration blade for that Azure Blob Storage linked service.
After setting up the linked service to the Azure Blob Storage account, I’ll create a dataset for the first of the two files to process. For the Employee Status load, I’ve set the properties of this new dataset to use the EmployeeStatus.txt file from that storage account. For this example, the only other change I had to make was to change the delimiter from a comma to a pipe to match the format of my source data file.
To complete the setup of the source for this copy data activity, I’ll set the properties of the source to use that new dataset.
I’ll repeat the last two steps – creating the dataset and using that dataset as a source – for the second copy data activity for the Employees.txt file.
With the source linked service and datasets configured, the next step is to configure the sinks for each of those activities. I’ll follow a similar process by first creating a linked service to my Azure SQL database, followed by the configuration of the sink settings for each activity. Shown below, I’ve created the new linked service for the Azure SQL database to which the data will be loaded.
The configuration of the dataset for a database table is usually very simple – just pick the table from the list.
Next, set the copy data activity to use this new dataset as its sink. In this example, the data in the flat files matches column-for-column with the schema in the table, so there’s no need to modify the data mappings. However, if needed, the Mapping tab can be used to align columns in the source with their counterparts in the sink.
Repeat the above for the sink for the other copy data activity to complete the configuration.
Add Pipeline Constraint
Once this is done, I’ll set up the constraints between these two activities. Because the EmployeeStatus table is referenced by a foreign key on the Employees table, this pipeline should constrain these operations to load EmployeeStatus followed by Employees. I’ll use an activity dependency to do this. As shown below, the arrow drawn between the two activities indicates that the one on the source side of the arrow (Copy Employee Status) must complete successfully before the one on the target side (Copy Employees) begins.
Test the Data Factory
It’s time to test! I’ll use Debug to test this functionality to make sure there are no errors, and that the output is what I expect. Shown below, I can run a test execution directly from the editing window.
Publish the Data Factory
The only thing left to do is to publish the finalized changes. Clicking the Publish All button moves the underlying changes from their temporary state into the live data factory, where it can now be invoked manually or through a schedule using a trigger.
In this post, I’ve briefly shared how you can begin creating your first Azure data factory. In my next posts on this topic, I’ll dig more into complex examples, and will demonstrate how to apply ETL best practices within Azure data factory architectures.
Thanks Tim I had to configure github to create my data factory and had to create storage, database and files too. All in all great framework you made here. -Rob
Hi Tim, thanks for posting a great article (I got this from Brent O. weekly links) I have always wondered how this all fitted together when, say you want to export a .csv file from Azure Blob storage to an Azure SQL Database. My question is if there is no corresponding table in the destination database, how do you get ADF to create it from scratch? Is there a setting for this? Cheers Scott
Hi Scott, I’ll be blogging soon about some of those design patterns, including importing/exporting CSVs and managing the underlying data in the target database.