Article
08/05/2008

Reusing Connections with Data Sources and Configurations

by Matthew Roche, MCT and SQL Server MVP. You can view Matthew’s blog at http://bi-polar23.blogspot.com/.

Introduction

SQL Server Integration Services is a high-performance enterprise-level Extract, Transform and Load (ETL) platform for a wide variety of business intelligence and data warehousing projects. Although Microsoft included Data Transformation Services (DTS) with SQL Server 7.0 and SQL Server 2000, it was not until the introduction of Integration Services with SQL Server 2005 that Microsoft had a true enterprise ETL toolkit. This toolkit integrates with software development lifecycle tools and best practices – the things that allow ETL to be a true first-class citizen in the project. One of these tools is integrated source code control. Because Integration Services development is done using Business Intelligence Development Studio (BIDS) project templates and tools in Visual Studio, Integration Services developers can use any source control system that integrates with Visual Studio, including Microsoft Visual SourceSafe and Team Foundation Server. Other tools include standard project-based development, integration with MSBuild and other build tools, and more.

These tools help enable teams of Integration Services developers to work on the same projects at the same time, without the risk of overwriting each other’s work, and help move ETL development into the mainstream of a larger software development project. Many of the same tools and techniques that can be used to automate and optimize a C# or Visual Basic .NET project also apply to Integration Services projects – something that was never possible in DTS.

Unfortunately, not all Integration Services features lend themselves well to larger projects, and it is not always clear which tools to use, and which tools to avoid. One feature that generally falls into the latter category is the data source. Data sources in BIDS can be valuable time-savers on demonstrations, prototypes and proofs of concept with only one developer, and this tends to be the type of environment in which most people get their first experience with Integration Services. Because of this, Integration Services developers will often attempt to apply this small-scale success with data sources to their large-scale projects. However there are disadvantages to using data sources in a large team development solution, as we will explore in this article.

Connection Managers: A Quick Review

Before we look at data sources, we should first review one additional Integration Services concept: the Connection Manager. Integration Services packages use connection managers when they reference resources outside the package. These external resources could be log files, other packages, mail servers and more, but most often they are the source and destination databases between which the package is moving data. There are two primary characteristics of connection managers that concern us in this context. First, a connection manager is contained by the package in which it is defined. In other words, a connection manager cannot be shared between Integration Services packages, although it can be shared between multiple components within the same package. Second, a connection manager has a ConnectionString property that identifies the file, database, or other external resource that it manages.

How Data Sources Work

Despite the fact that connection managers cannot be reused between packages, the reality is that in most ETL projects, multiple packages have to connect to the same source and destination databases, as shown in Figure 1.

Figure 1: A typical ETL scenario

In this scenario, data is loaded from multiple source systems, such as the Enterprise Resource Planning (ERP) and Line of Business (LOB) systems that are being used to populate a data warehouse. Data is extracted from these source systems by a set of Integration Services packages – often one package per table – into a common staging database. Data is then loaded from the staging database into a data warehouse for analysis. In addition, Integration Services projects are frequently organized around business processes, so that there will be packages in multiple projects extracting data from source systems into the staging database, and loading it from there into the data warehouse.

Data sources are designed to work around the limitations imposed by the single-package nature of connection managers. They give Integration Services developers a mechanism to define and store a ConnectionString outside of the package, so that it can be reused by connection managers in multiple packages. To add a data source to a project, simply right-click on the Data Sources node in the Solution Explorer window, select New Data Source from the context menu that appears, and specify the connection information in the dialog that appears. The data source is added as a .ds file to the Data Sources node in the project. These .ds files can then be shared and version controlled like any other file in the project.

It is worth noting that data sources are not technically an Integration Services concept. Data sources are provided by the BIDS development tool as a design-time aid, not by Integration Services itself. A data source is a project-level object that references a database, and information about the data source is stored in the .dtproj project file.

Once a data source has been added to a project, multiple connection managers can be created in multiple packages that reference the same data source. In this scenario, if the database connection information changes, the Integration Services developer needs to change that information only in one place. Once the developer updates the data source, all connection managers that reference it will use the updated connection information when executing inside BIDS. Figure 2 shows an example solution inside the BIDS Solution Explorer window. In this example, each project contains the data sources required by the packages within the project – although each package within a project is likely to reference the same databases as other packages within the same project, the connection information is stored in one location for ease of development.

Figure 2: A typical ETL solution in Solution Explorer

To create a connection manager that is based on a data source, developers can simply right-click in the Connection Managers area of the package designer window and select “New Connection From Data Source” as shown in Figure 3.

Figure 3: Creating a new connection from a data source

When creating a connection manager in this manner, you’ll notice two differences from creating a connection manager without a data source. First, a different icon is displayed in the Connection Managers window. Second, the connection manager’s DataSourceID property is assigned the name of the data source from which the connection manager was created, as shown in Figure 4.

Figure 4: Properties of a connection manager configured from a data source

What’s happening here is that BIDS is updating the project file to define the mapping between the connection manager and the data source. Remember – data sources are a convenience provided by the development environment, and not a feature of Integration Services. If you attempt to locate a DataSourceID property in the Integration Services object model, you will not find it.

The next time that data sources come into play is when a package that uses the data source is opened in the BIDS designer. At that point, BIDS checks to see whether the connection string in the data source file matches the ConnectionString property of the connection manager, and updates the package file with the connection string from the data source if the two are different. Figure 5 shows the Synchronize Connection Strings dialog that BIDS displays when the connection strings do not match.

Figure 5: Synchronizing connection strings based on a data source

Using Data Sources in a Multi-Project Solution

What we’ve seen so far have been ways to reuse connection information within a project. But data sources provide a mechanism for reusing connection information between multiple projects within a BIDS solution as well. This functionality can be very attractive; as we saw in Figure 1, it is common to have packages in multiple projects accessing common databases. By enabling connection information reuse between projects, data sources allow Integration Services developers to more easily respond to changes in their development environment, such as if the name of the data warehouse database were to change, or if the staging database were moved to a different development server.

This cross-project information sharing is enabled by the “Maintain a reference to another object in the solution” option in the Data Source Designer dialog box, as shown in Figure 6.

Figure 6: Configuring a data source based on another data source

Although it is possible to have data sources in one Integration Services project reference data sources in various other projects, a more manageable technique is to include in your solution a “common resources” project that contains all “master” data sources, and then have all data sources in other projects reference these “masters.”Figure 7 shows an example of how to set up this project. Note that there are no Integration Services packages in this project, but that other shared resources can also be included in the common resources project.

Figure 7: Using a common resources project for master data sources

As we’ve seen so far, data sources provide a mechanism by which Integration Services package developers can reuse connection information between packages in a project, or between packages in multiple projects in a solution. We’ve also seen that data sources are implemented as .ds files that are part of the project, so that they can be shared between members of a development team and versioned using source control like any other file within the project. So what’s not to like?

Drawbacks of Using Data Sources

Drawbacks of Using Data Sources in Team Development

So far data sources sound powerful, but they do have some serious drawbacks that make them problematic in larger team environments – exactly the type of environments where the functionality we’ve described would be most useful. Unfortunately, these drawbacks are founded in the very implementation details that we’ve discussed above, and they conspire to make data sources largely unusable in real-world team projects.

Consider a scenario where each Integration Services developer works with his own copy of the project databases. This is often seen with distributed teams, as performing ETL operations over WAN or VPN connections is painfully slow with most real-world data volumes. In this scenario, each developer in turn must update the .ds files with their own connection information. This means that each time a developer checks out and opens a package, after another team member has updated a data source used by that package, the developer is presented with the dialog from Figure 5. When he clicks OK to close the dialog and continue, the package will be validated using the new connection string set by the other team member. Because the other developer is using different settings that are often not valid in the current context, validation will fail. Often this validation takes 30 seconds or more, depending on the package and the connection settings. And the situation gets worseif you open multiple packages at once; for example, if you have configured BIDS to open all files that were open at the time the developer last closed BIDS. In this situation, there is no opportunity to exit or interrupt the process. The developer must manually click OK for each package, wait for its validation to complete, and then move on to the next package. For large solutions, this can be very frustrating and time consuming.

Another drawback of data sources in team scenarios comes from how BIDS compares two data sources to see whether they match. BIDS compares the connection strings by using simple string comparison; two functionally identical connection strings can cause a synchronization to occur if their name/value pairs are ordered differently. Also, because BIDS physically changes the package (.DTSX) files when applying the new connection information from a data source, the new connection string may even be functionally identical to the existing one. This can result in “false positives” on code churn reports, skewing the metrics gathered by project management, and making meaningful historical analysis of package changes even more difficult. This can also result in lost productivity. The DTSX files will be checked out silently by BIDS using the default source control settings, and the developer may not be aware that the checkout has taken place. Because of this, he may not check the files back in, and other developers may then be unable to perform their work without contacting the original developer to check back in the files.

As you can see, the drawbacks of data sources become apparent when working in a larger team environment, and it is on these projects where the lost productivity will have the greatest impact.

Drawbacks of Using Data Sources in Package Deployment

The disadvantages of using data sources for connection reuse do not end when package development is complete. When development is complete and the packages are promoted from the development environment to testing, or from the testing environment to production, the connection managers in the packages need to have their ConnectionString properties updated to reference the databases and other external resources in each environment. Because data sources are a feature of the BIDS designers in Visual Studio, but Visual Studio is not used in non-development environments, packages built to rely on data sources have no mechanism for updating their connection managers without editing the packages themselves. Obviously this is less than ideal; any deployment strategy that relies on manually editing the source code (which is essentially what DTSX files are for an Integration Services application) is bound to be fraught with problems.

Using Package Configurations to Replace Data Sources

Fortunately, Integration Services provides an ideal alternative to data sources: package configurations. A full discussion of package configurations deserves an article of its wn.But it is very likely that if you work in a team environment, you’ve already seen package configurations used to make packages location- and environment-independent at deployment time. It is simple to update your Integration Services projects to remove data sources and utilize the same tools for development configuration as you already use for post-deployment configuration. Here’s how to proceed:

1. Update the packages to use configurations to assign values to the ConnectionString property of each connection manager. Use indirect configurations, so that the path to the XML configuration file is stored in an environment variable. (You can also store configuration settings in a SQL Server database if desired, but for the purposes of this article we’ll assume that the data sources are being replaced with indirect XML file configurations.) This will allow you to eliminate any dependencies on hard-coded file paths for your configuration files.

2. Create a “common” set of configuration files for the development solution, and an additional set of configuration files for each developer whose configuration is different from the default common configuration, and for each environment into which the packages will be deployed. Figure 8 shows the BIDS Solution Explorer window with this change in place. The value of having multiple sets of configuration files is that all tested or supported configurations are all in a common location that is under source control and available to all team members.

Figure 8: Setting up configurations in a team solution

3. Create a batch file for each set of configuration files to set the environment variables used by the indirect configurations. Figure 9 shows an example Registry script that uses the SETX command line utility to create persistent environment variables. (The SETX utility is included with Windows Server 2003 and later, but is not included with Windows XP. A downloadable version is available for Windows XP, but the syntax is slightly different from the version included with more recent versions of Windows.) The value of having multiple batch files is that there will be a single file to run to set up the environment variables required for a given deployment or development environment, and these files are also stored in a common shared location.

Figure 9: Using the SETX utility to set up environment variables

4. Delete the data sources’ .ds files from each project. This will automatically remove the DataSourceID property from each connection manager in each package in the solution.

This technique can easily be adjusted to match your specific needs. Indirect configurations work just as well for SQL Server configurations as they do for XML file configurations; only the information stored in the environment variable needs to change. Also, you can have a different set of configuration files for each deployment environment, rather than for each developer, if that makes more sense for your particular scenario. How you structure things is up to you. The important thing is that the same package configurations are used – with the appropriate values – in both development and deployment.

A Batch Package Update Utility

One challenge is how to update all of your packages to use configurations instead of data sources. If you’re already using configurations for all of your connection managers, this is relatively simple. If you are not, the task of manually updating every package in a large solution can be daunting. Fortunately, Integration Services provides a set of .NET assemblies that expose an object model for programmatically executing, creating and modifying packages. It’s relatively simple to use this object model to automate the task at hand.

In addition, there is an existing project on CodePlex called “PacMan” – the SSIS Package Manager – that can provide a head start to complete this task. (You can view and download Pacman from https://www.codeplex.com/pacman.) The PacMan user interface is shown in Figure 10.

Figure 10: The user interface of PacMan, the SSIS Package Manager, a Codeplex project

Although PacMan has some complete features built in, its greater value comes from the framework that it provides for performing batch updates to sets of packages. PacMan is a rough development utility designed to be enhanced through code, more than it is intended to be used as-is. Figure 11 shows the projects and classes that make up the PacMan solution in Visual Studio.

Figure 11: The PacMan solution in Visual Studio

The Components project consists of a set of classes that wrap and hide the complexity of working with the Integration Services object model, or for working with solutions, projects or sets of packages. The UI project contains a single Windows form that allows the user to select packages, and update them in batch mode to add indirect XML configurations, and to remove data source bindings from the projects, and to perform other batch updates. The PackageCollectionUtil class in the Components project encapsulates a set of Integration Services packages. When the user selects a project, solution or folder tree through the PacMan user interface, a form-level instance of this class is populated with those packages found within the selected container; this collection can then be used anywhere within the PackageManager form.

For example, to add an identical indirect XML configuration to each package in the selected collection, use the following code:

private void AddConnectionManagerConfiguration(string connectionManagerName,
    string environmentVariableName)
{
    if (packages != null)
    {
        // The root folder for the packages collection is a "reasonable
        // default" for storing the XML config file
        string configFilePath = Path.Combine(packages.RootCollectionPath,
            connectionManagerName + ".dtsConfig");
        // Create an environment variable that references the XML config file
        Environment.SetEnvironmentVariable(environmentVariableName,
            configFilePath, EnvironmentVariableTarget.Machine);
        foreach (PackageUtil pu in packages)
        {
            Package p = pu.SsisPackage;
            bool isFirstPackage = true;
            // Update each package in the packages collection
            if (p.Connections.Contains(connectionManagerName))
            {
                if (isFirstPackage)
                {
                    // Save the configuration to an XML file
                    SaveXmlConfigFile(configFilePath, p, connectionManagerName);
                    isFirstPackage = false;
                }
                // Ensure that configurations are enabled
                p.EnableConfigurations = true;
                string configurationName = connectionManagerName + "_Configuration";
                // Remove any existing configuration with the same name
                if (p.Configurations.Contains(configurationName))
                {
                    p.Configurations.Remove(configurationName);
                }
                // Create a new indirect XML configuration
                Configuration c = p.Configurations.Add();
                c.Name = configurationName;
                c.ConfigurationType = DTSConfigurationType.IConfigFile;
                c.ConfigurationString = environmentVariableName;
            }
        }
        packages.Save();
    }
}
private void SaveXmlConfigFile(string configFilePath,
    Package sourcePackage, string connectionManagerName)
{
    XmlTextWriter xw = new XmlTextWriter(configFilePath, Encoding.UTF8);
    xw.Formatting = Formatting.Indented;
    xw.WriteStartDocument();
    xw.WriteStartElement("DTSConfiguration");
    xw.WriteStartElement("DTSConfigurationHeading");
    xw.WriteStartElement("DTSConfigurationFileInfo");
    xw.WriteAttributeString("GeneratedBy", sourcePackage.CreatorName);
    xw.WriteAttributeString("GeneratedFromPackageName", sourcePackage.Name);
    xw.WriteAttributeString("GeneratedFromPackageID", sourcePackage.ID);
    xw.WriteAttributeString("GeneratedDate", DateTime.Now.ToString("G"));
    xw.WriteAttributeString("LastModifiedDate", DateTime.Now.ToShortDateString());
    xw.WriteAttributeString("Description", sourcePackage.Description);
    xw.WriteEndElement(); // Close the DTSConfigurationFileInfo element
    xw.WriteEndElement(); // Close the DTSConfigurationHeading element
    xw.WriteStartElement("Configuration");
    xw.WriteAttributeString("ConfiguredType", "Property");
    xw.WriteAttributeString("Path", string.Format(
        "\\Package.Connections[{0}].Properties[ConnectionString]",
        connectionManagerName));
    xw.WriteAttributeString("ValueType", "String");
    xw.WriteElementString("ConfiguredValue",
        sourcePackage.Connections[connectionManagerName].ConnectionString);
    xw.WriteEndElement(); // Close the Configuration element
    xw.WriteEndElement(); // Close the DTSConfiguration element
    xw.Flush();
    xw.Close();
}

This sample code performs a set of tasks to update the selected packages to use indirect XML configurations, including:

Create an XML configuration file using the ConnectionString property for the selected connection manager. The ConnectionString value (as well as other values used in the config file) is taken from the first package in the packages collection. As all packages with a common connection manager are likely to use the same ConnectionString, this is a simplistic but generally successful approach.
Create a machine environment variable that contains the path to the XML configuration file.
Ensure that configurations are enabled for each package in the packages collection.
Add a new indirect XML configuration to each package in the packages collection, referencing the environment variable to locate the XML configuration file.
Save each package in the packages collection.

Once this batch update is completed, the hard work is done. All that remains to be done is to remove the data sources from the packages.

To remove the data sources from all packages in a given project, the project file, not the packages, must be modified – remember that the data sources are defined in the project file itself. The PacMan Components project includes a DtProjHelper class, which implements a DeleteDataSources method that deletes the DataSources elements from the project file:

public static void DeleteDataSources(FileInfo projectFile)
{
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.Load(projectFile.FullName);
    XmlElement projectNode = xmlDoc.DocumentElement;
    XmlNode dataSourcesNode = projectNode.SelectSingleNode
        ("DataSources");
    dataSourcesNode.RemoveAll();
    xmlDoc.Save(projectFile.FullName);
}

Although the Integration Services .NET API provides all of the core functionality necessary to update packages to add package configurations, the PacMan utility provides a valuable head start for batch updating multiple packages. You can download the source code for the PacMan utility from the CodePlex web site, and modify it to serve your own needs. The techniques it enables (as demonstrated in the samples above) can be used to solve a wide range of batch package update problems, far beyond data sources and configurations.

Conclusion

In summary, Integration Services is an enterprise ETL platform more than ready for large projects and large project teams. Data sources are useful for smaller projects, but quickly fall short when used in this larger context in which Integration Services excels. Replacing data sources with package configurations gives Integration Services developers a reliable team-ready tool for managing different development, test and production environments. By using the Integration Services .NET object model – perhaps by using the PacMan utility – large numbers of packages can be updated reliably in batch mode without introducing the human errors associated with manually updating each package. This approach provides a simple update path for projects that are currently using data sources.

About the author**. Microsoft SQL Server MVP Matthew Roche is a Data Architect with systems management innovator Configuresoft, Inc., and is also Chief Software Architect of Integral Thought & Memory, a BI-focused consultancy. Matthew has worked with Microsoft SQL Server since the mid 90s, starting with SQL Server 6.0 and 6.5, and has been falling deeper and deeper in love with SQL Server with each passing year. As a Microsoft Certified Trainer, Matthew has taught hundreds of database administrators and database developers how to better utilize SQL Server’s capabilities. As a software developer and architect, Matthew has built dozens of applications that use the SQL Server platform. Matthew has been using SQL Server 2005 since its early beta days and has been working with SQL Server 2008 since early 2007. Matthew’s current focus is on the SQL Server Business Intelligence stack, specifically SQL Server Integration Services. When not delivering training, presenting on BI topics at conferences, or listening to the world’s loudest heavy metal, Matthew is hard at work helping design and build Configuresoft’s Configuration Intelligence Analytics (CIA) platform for bringing the power of BI to enterprise IT.