Creating Your First Azure Data Factory

Azure Data Factory has grown in both popularity and utility in the past several years. It has evolved beyond its significant limitations in its initial version, and is quickly rising as a strong enterprise-capable ETL tool.

In my last post on this topic, I shared my comparison between SQL Server Integration Services and ADF. In this post, I’ll walk through the principal moving parts of creating your first Azure data factory.

Creating Your First Azure Data Factory

To create a new data factory, you’ll first log into your Azure portal. If you haven’t yet set up an Azure account, you can sign up here and get a good deal of free services to help you get started.

In the Azure portal, the default view will show those resources that were recently created or used. You’ll also see a list of shortcuts to some of your commonly used services.

Azure Portal

Click on the link to Data Factories. If you don’t immediately see it in your shortcuts, you can use the Azure search bar to search for it. On the Data factories window, you’ll the list of data factories you’ve created (if any). From here, you can click the Add button to begin creating your first Azure data factory.

Azure Data Factory list

Create a New Data Factory

As shown below, the Create Data Factory screen is fairly simple. Select the subscription, resource group, and region, and give it a name – that’s it. You can also specify the data factory version, but the only real option here is V2 (trust me when I say that ADF V1 was very V1). You’ll also notice the option to Enable Managed Virtual Network, which I’ll cover in a later post.

Creating your first Azure data factory

After creating new data factory, the next step is to open the editor. Shown below is the new, empty data factory ready to be configured.

New data factory

For this example, I’m going to build a simple import that will perform the following tasks:

  • Connect to a file named EmployeeStatus.txt in an Azure blob storage account, and load that data to a table of the same name in an Azure SQL database
  • Connect to a file named Employees.txt in an Azure blob storage account, and load that data to a table of the same name in an Azure SQL database

I’ll use a single pipeline named Dunder Mifflin Employees, and will include two activities in that pipeline, one for each of the files I will import into the database.

Azure Data Factory pipeline

For each of these operations, I’m using the copy data activity. This is one of the simplest activities available in ADF, and allows us to grab all of the data from a source and load it all to a destination (commonly referred to as a sink in ADF). To configure the settings, click on the activity to reveal the properties window as shown below.

ADF copy data activity

Setting up Datasets and Linked Services

Most of the structured data movement activities require the creation of an underlying dataset. A dataset in ADF defines the connection and metadata for a set of structured data used as either a source or a sink. Each of the copy data activities I’m using requires two datasets – one for the source, another for the sink. You’ll notice that there’s an attention flag (the red number 1) beside Source and Sink in the properties window, indicating that the datasets have not yet been defined for either of these.

Before creating the source dataset, I’ll have to first create a connection (called a linked service in ADF) to the storage account, named DunderMifflinDataFiles. Both of the files are located in the same folder in that storage account, so I can just create a single linked service and then specify for each copy activity the file I want to use. Shown below is the configuration blade for that Azure Blob Storage linked service.

Azure Blob Storage linked service

After setting up the linked service to the Azure Blob Storage account, I’ll create a dataset for the first of the two files to process. For the Employee Status load, I’ve set the properties of this new dataset to use the EmployeeStatus.txt file from that storage account. For this example, the only other change I had to make was to change the delimiter from a comma to a pipe to match the format of my source data file.

ADF source file properties

To complete the setup of the source for this copy data activity, I’ll set the properties of the source to use that new dataset.

ADF copy activity source

I’ll repeat the last two steps – creating the dataset and using that dataset as a source – for the second copy data activity for the Employees.txt file.

With the source linked service and datasets configured, the next step is to configure the sinks for each of those activities. I’ll follow a similar process by first creating a linked service to my Azure SQL database, followed by the configuration of the sink settings for each activity. Shown below, I’ve created the new linked service for the Azure SQL database to which the data will be loaded.

Azure SQL DB linked service

The configuration of the dataset for a database table is usually very simple – just pick the table from the list.

ADF table dataset

Next, set the copy data activity to use this new dataset as its sink. In this example, the data in the flat files matches column-for-column with the schema in the table, so there’s no need to modify the data mappings. However, if needed, the Mapping tab can be used to align columns in the source with their counterparts in the sink.

ADF sink

Repeat the above for the sink for the other copy data activity to complete the configuration.

Add Pipeline Constraint

Once this is done, I’ll set up the constraints between these two activities. Because the EmployeeStatus table is referenced by a foreign key on the Employees table, this pipeline should constrain these operations to load EmployeeStatus followed by Employees. I’ll use an activity dependency to do this. As shown below, the arrow drawn between the two activities indicates that the one on the source side of the arrow (Copy Employee Status) must complete successfully before the one on the target side (Copy Employees) begins.

ADF Activity dependency

Test the Data Factory

It’s time to test! I’ll use the Debug feature to test this functionality to make sure there are no errors, and that the output is what I expect. Shown below, I can run an execution through the debugger directly from the editing window.

ADF testing using debug

Publish the Data Factory

The only thing left to do is to publish the finalized changes. Clicking the Publish All button moves the underlying changes from their temporary state into the live data factory, where it can now be invoked manually or through a trigger or other scheduling mechanism.

Publish ADF

Conclusion

In this post, I’ve briefly shared how you can begin creating your first Azure data factory. In my next posts on this topic, I’ll dig more into complex examples, and will demonstrate how to apply ETL best practices within Azure data factory architectures.

About the Author

Tim Mitchell
Tim Mitchell is a data architect and consultant who specializes in getting rid of data pain points. Need help with data warehousing, ETL, reporting, or training? If so, contact Tim for a no-obligation 30-minute chat.

3 Comments on "Creating Your First Azure Data Factory"

  1. Rob MacNaughton | September 2, 2020 at 9:17 am | Reply

    Thanks Tim I had to configure github to create my data factory and had to create storage, database and files too. All in all great framework you made here. -Rob

  2. Hi Tim, thanks for posting a great article (I got this from Brent O. weekly links) I have always wondered how this all fitted together when, say you want to export a .csv file from Azure Blob storage to an Azure SQL Database. My question is if there is no corresponding table in the destination database, how do you get ADF to create it from scratch? Is there a setting for this? Cheers Scott

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.