When designing the ETL architecture for new or changing systems, one of the more common needs is to create a process that copies all of the data from a set of tables (perhaps even all tables) from a source system to a destination database. This is especially prevalent when building an operation data store (ODS), or building a set of staging tables from OLTP data. Creating ETL processes for these operations can be a time-consuming operation, even when each ETL pipeline is very simple. If you’re using SSIS, however, it may be possible to eliminate most if not all of the tedious work in staging this data by building the staging load with Biml.
In this post, I’ll walk you through creating a simple staging load with Biml. In fact, all of the logic for this process can be done in less than 100 lines of code (download here).
What Is Biml?
Biml is an acronym for Business Intelligence Markup Language. It is a language designed to automate the creation of SSIS packages. Biml is an XML-based language, and can be wrapped with either C# or VB.net; the combination of Biml and C# or VB.net is usually referred to as BimlScript. While Biml alone can be useful, BimlScript is an incredibly powerful tool for building dozens or even hundreds of packages with a very small amount of code.
To use Biml, you’ll need to download BimlExpress, a free SSDT add-in. You can also use Mist, a commercial product that adds a graphical development environment, better project and file management, and support for SSAS cube design, among many other benefits. For the example in this post, I’ll be using the free BimlExpress environment.
Create a Staging Load with Biml
For the source of this data load, I’ll be using a copy of the AdventureWorks database. For the destination database, I’ll create a new blank database named AdventureWorksODS. Since I’m starting off with a blank destination database, the logic I build will need to include the necessary T-SQL to create the staging tables. To keep things properly separated, I’ll build two separate projects in SSDT: one for the packages to create the destination tables, and the other to handle the movement of the data from source to destination.
Create the DDL Packages for Table Creation
The first part of this effort will be to build the packages that will create the staging tables in the destination database. I’ll start this by building a Biml file named Connections.biml, which is where I’ll create the definitions for my data connections. Note that I have a connection to the source AdventureWorks database as well as the new AdventureWorksODS database.
Next, I’ll create a file named CreateTables.biml, in which I will write Biml code to build one package per table. Each of these packages will contain an SQL query that will create the new staging table in the destination database. To accomplish this, I’ll need a list of tables to process, which I can get using the GetDatabaseSchema() function in Biml to interrogate my AdventureWorks source database (using the connection created in the previous step). [Edit: Removed an unnecessary <Tables> declaration in the code below. Thanks to my friend Ben Weissman for catching that error!]
As shown above, I’m using the optional parameters for GetDatabaseSchema() to include only those tables in the [Sales] schema. I’m also excluding items such as keys, identity values, and indexes, since I’d likely need to modify those in the staging database to meet the needs of the workload on those new tables.
Once this is done, I’ll create my <Packages> collection, and inside that, set up a loop using C# BimlScript that will loop through each table name retrieved from AdventureWorks in the previous step.
Inside the loop, I create a new <Package> object in Biml, which will build a new package for each table in the collection. As shown below, this logic creates a dynamic name for the package and the ExecuteSQL task based on the name of the table in the source. I’m using the Biml function GetTableSql() to generate the CREATE TABLE statement for each table, which uses the column metadata retrieved earlier from the GetDatabaseSchema() output.
I could have stopped there, and would have had a collection of packages to execute individually to create those new staging tables. However, I prefer to automate! I’ll borrow from the logic above, and create a controller package that will execute each of these packages. By using the same BimlScript loop logic from earlier, I’ll iterate through the table list, this time creating one Execute Package Task inside a new controller package.
After saving the Connections.biml and CreateTables.biml files above, I’ll select both of them in the Visual Studio Solution Explorer, right-click, and choose Generate SSIS Packages. In doing so, the BimlExpress engine generates 20 new package – one for each of the 19 tables in the [Sales] schema, plus the controller package.
In each of these worker packages resides a single Execute SQL Task with a DROP/CREATE TABLE query.
Executing the controller package in the above project will invoke each of these DDL packages. Although this part of the solution is likely to be executed just once, the fact that we can use Biml to automate this logic can save a lot of manual work in prepping the staging database.
Creating the Load Packages
After executing the above controller package to create the staging tables, the next step is to create the actual load packages to move the data from source to staging tables. I’m using a similar design as before, in which I’ll create one worker package per table, along with a controller package to orchestrate the whole load. Each of these worker packages will contain two tasks: an Execute SQL Task to truncate the staging table, and a Data Flow Task to copy the data from source to stage.
Also worth noting is that this logic will reside in a separate project. Although I could have used a single SSDT project for the DDL and the DML operations, these tasks are very different in terms of scope and frequency. I’m a fan of keeping the work surface small, so I’m separating these into separate projects. I’ll create the projects in the same Visual Studio solution, however, since that structure is only for code organization and has no impact on execution or deployment.
In the same manner that I created the Connections.biml file in the earlier project, I’ll do the same thing here. I’m also using the same call to GetDatabaseSchema(), and the exact same BimlScript loop to iterate through each table to create one per table. In that package declaration, I am adding the attribute ConstraintMode=”Linear” to ensure that the truncate statement occurs before the data flow operation.
Inside that new package, I’ll create an Execute SQL Task to truncate the staging table.
Next, I’ll create the Data Flow Task. Inside there, I’m including the OleDB source and OleDB destination to get data from the source and send it to the staging table. In each of those transformations, I include the name and schema of the source and destination table using a BimlScript snippet.
Note that I’m not doing any data mappings at all, instead relying on the automatic data mapping feature of Biml. I can get away with this because I know the source metadata matches the destination metadata precisely. If there were a need to map columns of different names from source to destination, this part of the code would be more complicated. I’ll cover handling data mappings like that in a future post.
Finally, I’ll borrow the logic from the previous project to create the controller package for this one as well.
After selecting both the LoadStagingTables.biml file and the Connections.biml files and generating the SSIS packages, I finish with another 20 packages, one for each table and the controller package. Each one of these has an Execute SQL Task and Data Flow Task, the latter of which loads all of the rows from the source to the new staging table.
Other Design Considerations
This is a relatively simple design, and assumes that we’re simply loading all of the data from each source table to the destination table. Depending on how complex your ETL needs are, you may need to add one or more of the following items to the processing logic:
- Adding transformations in the data flow
- Order of precedence, in case some tables need to be loaded before others (to honor foreign key constraints, for example)
- Handle incremental loads rather than a full load
- Mapping source columns to differently-named columns in the destination table
- Using parallelism to load multiple tables at once
The good news is that all of these things are possible with Biml!
In this post, I’ve demonstrated how to build a staging load with Biml that will create staging tables in a destination database and build the load processes for each of those tables. Creating these packages manually would have taken hours, possibly even days or weeks if there were many tables. In this case, it took a little less than 100 lines of BimlScript code to do the same thing. If you want to test this out for yourself, you can get the full Biml files here.
Nice post .. i worked on BIML and i even generated the full integration flow from file mapping to create staging DB table till generatiing a SQL stored procedure.. using NET Framework ( SSIS : ETL flow + SQL stored procedure) , i think this approach is the best in order to be abale to do the data clean up and validate everything with the destination DB.
Please check my blog : http://microsoftbicoach.blogspot.co.id/