by Matthew Roche, MCT and SQL Server MVP. You can view Matthew’s blog
at http://bi-polar23.blogspot.com/.
Introduction
SQL Server Integration Services is a high-performance
enterprise-level Extract, Transform and Load (ETL) platform for a wide variety
of business intelligence and data warehousing projects. Although Microsoft
included Data Transformation Services (DTS) with SQL Server 7.0 and SQL Server
2000, it was not until the introduction of Integration Services with SQL Server
2005 that Microsoft had a true enterprise ETL toolkit. This toolkit integrates
with software development lifecycle tools and best practices – the things that
allow ETL to be a true first-class citizen in the project. One of these tools
is integrated source code control. Because Integration Services development is
done using Business Intelligence Development Studio (BIDS) project templates
and tools in Visual Studio, Integration Services developers can use any source
control system that integrates with Visual Studio, including Microsoft Visual
SourceSafe and Team Foundation Server. Other tools include standard
project-based development, integration with MSBuild and other build tools, and
more.
These tools help enable teams of Integration Services
developers to work on the same projects at the same time, without the risk of
overwriting each other’s work, and help move ETL development into the mainstream
of a larger software development project. Many of the same tools and techniques
that can be used to automate and optimize a C# or Visual Basic .NET project
also apply to Integration Services projects – something that was never possible
in DTS.
Unfortunately, not all Integration Services features lend
themselves well to larger projects, and it is not always clear which tools to
use, and which tools to avoid. One feature that generally falls into the latter
category is the data source. Data sources in BIDS can be valuable time-savers
on demonstrations, prototypes and proofs of concept with only one developer,
and this tends to be the type of environment in which most people get their
first experience with Integration Services. Because of this, Integration Services
developers will often attempt to apply this small-scale success with data
sources to their large-scale projects. However there are disadvantages to using
data sources in a large team development solution, as we will explore in this
article.
Connection Managers: A Quick Review
Before we look at data sources, we should first review one
additional Integration Services concept: the Connection Manager. Integration
Services packages use connection managers when they reference resources outside
the package. These external resources could be log files, other packages, mail
servers and more, but most often they are the source and destination databases
between which the package is moving data. There are two primary characteristics
of connection managers that concern us in this context. First, a connection
manager is contained by the package in which it is defined. In other words, a
connection manager cannot be shared between Integration Services packages,
although it can be shared between multiple components within the same package.
Second, a connection manager has a ConnectionString property that identifies
the file, database, or other external resource that it manages.
How Data Sources Work
Despite the fact that connection managers cannot be reused
between packages, the reality is that in most ETL projects, multiple packages
have to connect to the same source and destination databases, as shown in
Figure 1.
.jpg)
Figure 1: A typical ETL scenario
In this scenario, data is loaded from multiple source
systems, such as the Enterprise Resource Planning (ERP) and Line of Business
(LOB) systems that are being used to populate a data warehouse. Data is
extracted from these source systems by a set of Integration Services packages –
often one package per table – into a common staging database. Data is then
loaded from the staging database into a data warehouse for analysis. In
addition, Integration Services projects are frequently organized around
business processes, so that there will be packages in multiple projects extracting
data from source systems into the staging database, and loading it from there
into the data warehouse.
Data sources are designed to work around the limitations
imposed by the single-package nature of connection managers. They give
Integration Services developers a mechanism to define and store a
ConnectionString outside of the package, so that it can be reused by connection
managers in multiple packages. To add a data source to a project, simply
right-click on the Data Sources node in the Solution Explorer window, select
New Data Source from the context menu that appears, and specify the connection
information in the dialog that appears. The data source is added as a .ds file
to the Data Sources node in the project. These .ds files can then be shared and
version controlled like any other file in the project.
It is worth noting that data sources are not technically an
Integration Services concept. Data sources are provided by the BIDS development
tool as a design-time aid, not by Integration Services itself. A data source is
a project-level object that references a database, and information about the
data source is stored in the .dtproj project file.
Once a data source has been added to a project, multiple
connection managers can be created in multiple packages that reference the same
data source. In this scenario, if the database connection information changes,
the Integration Services developer needs to change that information only in one
place. Once the developer updates the data source, all connection managers that
reference it will use the updated connection information when executing inside
BIDS. Figure 2 shows an example solution inside the BIDS Solution Explorer
window. In this example, each project contains the data sources required by the
packages within the project – although each package within a project is likely
to reference the same databases as other packages within the same project, the
connection information is stored in one location for ease of development.
.jpg)
Figure 2: A typical ETL solution in Solution Explorer
To create a connection manager that is based on a data
source, developers can simply right-click in the Connection Managers area of
the package designer window and select “New Connection From Data Source” as
shown in Figure 3.
.jpg)
Figure 3: Creating a new connection from a data source
When creating a connection manager in this manner, you’ll
notice two differences from creating a connection manager without a data
source. First, a different icon is displayed in the Connection Managers window.
Second, the connection manager’s DataSourceID property is assigned the name of
the data source from which the connection manager was created, as shown in
Figure 4.
.jpg)
Figure 4: Properties of a connection manager configured from a data
source
What’s happening here is that BIDS is updating the project
file to define the mapping between the connection manager and the data source.
Remember – data sources are a convenience provided by the development
environment, and not a feature of Integration Services. If you attempt to
locate a DataSourceID property in the Integration Services object model, you
will not find it.
The next time that data sources come into play is when a
package that uses the data source is opened in the BIDS designer. At that
point, BIDS checks to see whether the connection string in the data source file
matches the ConnectionString property of the connection manager, and updates
the package file with the connection string from the data source if the two are
different. Figure 5 shows the Synchronize Connection Strings dialog that BIDS
displays when the connection strings do not match.
.jpg)
Figure 5: Synchronizing connection strings based on a data source
Using Data Sources in a Multi-Project Solution
What we’ve seen so far have been ways to reuse connection
information within a project. But data sources provide a mechanism for reusing
connection information between multiple projects within a BIDS solution as
well. This functionality can be very attractive; as we saw in Figure 1, it is
common to have packages in multiple projects accessing common databases. By
enabling connection information reuse between projects, data sources allow
Integration Services developers to more easily respond to changes in their
development environment, such as if the name of the data warehouse database
were to change, or if the staging database were moved to a different
development server.
This cross-project information sharing is enabled by the
“Maintain a reference to another object in the solution” option in the Data
Source Designer dialog box, as shown in Figure 6.
.jpg)
Figure 6: Configuring a data source based on another data source
Although it is possible to have data sources in one
Integration Services project reference data sources in various other projects,
a more manageable technique is to include in your solution a “common resources”
project that contains all “master” data sources, and then have all data sources
in other projects reference these “masters.”Figure 7 shows an example of how to set up this project. Note that there
are no Integration Services packages in this project, but that other shared
resources can also be included in the common resources project.
.jpg)
Figure 7: Using a common resources project for master data sources
As we’ve seen so far, data sources provide a mechanism by
which Integration Services package developers can reuse connection information
between packages in a project, or between packages in multiple projects in a
solution. We’ve also seen that data sources are implemented as .ds files that
are part of the project, so that they can be shared between members of a
development team and versioned using source control like any other file within
the project. So what’s not to like?
Drawbacks of Using Data Sources
Drawbacks of Using Data Sources in Team Development
So far data sources sound powerful, but they do have some
serious drawbacks that make them problematic in larger team environments –
exactly the type of environments where the functionality we’ve described would
be most useful. Unfortunately, these drawbacks are founded in the very
implementation details that we’ve discussed above, and they conspire to make
data sources largely unusable in real-world team projects.
Consider a scenario where each Integration Services
developer works with his own copy of the project databases. This is often seen
with distributed teams, as performing ETL operations over WAN or VPN
connections is painfully slow with most real-world data volumes. In this
scenario, each developer in turn must update the .ds files with their own
connection information. This means that each time a developer checks out and
opens a package, after another team member has updated a data source used by
that package, the developer is presented with the dialog from Figure 5. When he
clicks OK to close the dialog and continue, the package will be validated using
the new connection string set by the other team member. Because the other
developer is using different settings that are often not valid in the current
context, validation will fail. Often this validation takes 30 seconds or more,
depending on the package and the connection settings. And the situation gets
worseif you open multiple packages at once; for example, if you have configured
BIDS to open all files that were open at the time the developer last closed
BIDS. In this situation, there is no opportunity to exit or interrupt the
process. The developer must manually click OK for each package, wait for its
validation to complete, and then move on to the next package. For large
solutions, this can be very frustrating and time consuming.
Another drawback of data sources in team scenarios comes
from how BIDS compares two data sources to see whether they match. BIDS
compares the connection strings by using simple string comparison; two
functionally identical connection strings can cause a synchronization to occur
if their name/value pairs are ordered differently. Also, because BIDS
physically changes the package (.DTSX) files when applying the new connection
information from a data source, the new connection string may even be
functionally identical to the existing one. This can result in “false
positives” on code churn reports, skewing the metrics gathered by project
management, and making meaningful historical analysis of package changes even
more difficult. This can also result in lost productivity. The DTSX files will
be checked out silently by BIDS using the default source control settings, and
the developer may not be aware that the checkout has taken place. Because of
this, he may not check the files back in, and other developers may then be
unable to perform their work without contacting the original developer to check
back in the files.
As you can see, the drawbacks of data sources become
apparent when working in a larger team environment, and it is on these projects
where the lost productivity will have the greatest impact.
Drawbacks of Using Data Sources in Package Deployment
The disadvantages of using data sources for connection reuse
do not end when package development is complete. When development is complete
and the packages are promoted from the development environment to testing, or
from the testing environment to production, the connection managers in the
packages need to have their ConnectionString properties updated to reference
the databases and other external resources in each environment. Because data
sources are a feature of the BIDS designers in Visual Studio, but Visual Studio
is not used in non-development environments, packages built to rely on data
sources have no mechanism for updating their connection managers without
editing the packages themselves. Obviously this is less than ideal; any
deployment strategy that relies on manually editing the source code (which is
essentially what DTSX files are for an Integration Services application) is
bound to be fraught with problems.
Using Package Configurations to Replace Data Sources
Fortunately, Integration Services provides an ideal
alternative to data sources: package configurations. A full discussion of
package configurations deserves an article of its wn.But it is very likely that if you work in a
team environment, you’ve already seen package configurations used to make
packages location- and environment-independent at deployment time. It is simple
to update your Integration Services projects to remove data sources and utilize
the same tools for development configuration as you already use for
post-deployment configuration. Here’s how to proceed:
1. Update the packages to use configurations to assign
values to the ConnectionString property of each connection manager. Use
indirect configurations, so that the path to the XML configuration file is
stored in an environment variable. (You can also store configuration settings
in a SQL Server database if desired, but for the purposes of this article we’ll
assume that the data sources are being replaced with indirect XML file
configurations.) This will allow you to eliminate any dependencies on
hard-coded file paths for your configuration files.
2. Create a “common” set of configuration files for the
development solution, and an additional set of configuration files for each
developer whose configuration is different from the default common
configuration, and for each environment into which the packages will be deployed.
Figure 8 shows the BIDS Solution Explorer window with this change in place. The
value of having multiple sets of configuration files is that all tested or
supported configurations are all in a common location that is under source
control and available to all team members.
.jpg)
Figure 8: Setting up configurations in a team solution
3. Create a batch file for each set of configuration files
to set the environment variables used by the indirect configurations. Figure 9
shows an example Registry script that uses the SETX command line utility to
create persistent environment variables. (The SETX utility is included with
Windows Server 2003 and later, but is not included with Windows XP. A
downloadable version is available for Windows XP, but the syntax is slightly different
from the version included with more recent versions of Windows.) The value of
having multiple batch files is that there will be a single file to run to set
up the environment variables required for a given deployment or development
environment, and these files are also stored in a common shared location.
.jpg)
Figure 9: Using the SETX utility to set up environment variables
4. Delete the data sources’ .ds files from each project.
This will automatically remove the DataSourceID property from each connection
manager in each package in the solution.
This technique can easily be adjusted to match your specific
needs. Indirect configurations work just as well for SQL Server configurations
as they do for XML file configurations; only the information stored in the
environment variable needs to change. Also, you can have a different set of
configuration files for each deployment environment, rather than for each
developer, if that makes more sense for your particular scenario. How you
structure things is up to you. The important thing is that the same package
configurations are used – with the appropriate values – in both development and
deployment.
A Batch Package Update Utility
One challenge is how to update all of your packages to use
configurations instead of data sources. If you’re already using configurations
for all of your connection managers, this is relatively simple. If you are not,
the task of manually updating every package in a large solution can be
daunting. Fortunately, Integration Services provides a set of .NET assemblies
that expose an object model for programmatically executing, creating and
modifying packages. It’s relatively simple to use this object model to automate
the task at hand.
In addition, there is an existing project on CodePlex called
“PacMan” – the SSIS Package Manager – that can provide a head start to complete
this task. (You can view and download Pacman from http://www.codeplex.com/pacman.) The
PacMan user interface is shown in Figure 10.
.jpg)
Figure 10: The user interface of PacMan, the SSIS Package Manager,
a Codeplex project
Although PacMan has some complete features built in, its
greater value comes from the framework that it provides for performing batch
updates to sets of packages. PacMan is a rough development utility designed to
be enhanced through code, more than it is intended to be used as-is. Figure 11
shows the projects and classes that make up the PacMan solution in Visual
Studio.
.jpg)
Figure 11: The PacMan solution in Visual Studio
The Components project consists of a set of classes that
wrap and hide the complexity of working with the Integration Services object
model, or for working with solutions, projects or sets of packages. The UI
project contains a single Windows form that allows the user to select packages,
and update them in batch mode to add indirect XML configurations, and to remove
data source bindings from the projects, and to perform other batch updates. The
PackageCollectionUtil class in the Components project encapsulates a set of
Integration Services packages. When the user selects a project, solution or
folder tree through the PacMan user interface, a form-level instance of this
class is populated with those packages found within the selected container; this
collection can then be used anywhere within the PackageManager form.
For example, to add an identical indirect XML configuration
to each package in the selected collection, use the following code:
private void AddConnectionManagerConfiguration(string connectionManagerName,
string environmentVariableName)
{
if (packages != null)
{
// The root folder for the packages collection is a "reasonable
// default" for storing the XML config file
string configFilePath = Path.Combine(packages.RootCollectionPath,
connectionManagerName + ".dtsConfig");
// Create an environment variable that references the XML config file
Environment.SetEnvironmentVariable(environmentVariableName,
configFilePath, EnvironmentVariableTarget.Machine);
foreach (PackageUtil pu in packages)
{
Package p = pu.SsisPackage;
bool isFirstPackage = true;
// Update each package in the packages collection
if (p.Connections.Contains(connectionManagerName))
{
if (isFirstPackage)
{
// Save the configuration to an XML file
SaveXmlConfigFile(configFilePath, p, connectionManagerName);
isFirstPackage = false;
}
// Ensure that configurations are enabled
p.EnableConfigurations = true;
string configurationName = connectionManagerName + "_Configuration";
// Remove any existing configuration with the same name
if (p.Configurations.Contains(configurationName))
{
p.Configurations.Remove(configurationName);
}
// Create a new indirect XML configuration
Configuration c = p.Configurations.Add();
c.Name = configurationName;
c.ConfigurationType = DTSConfigurationType.IConfigFile;
c.ConfigurationString = environmentVariableName;
}
}
packages.Save();
}
}
private void SaveXmlConfigFile(string configFilePath,
Package sourcePackage, string connectionManagerName)
{
XmlTextWriter xw = new XmlTextWriter(configFilePath, Encoding.UTF8);
xw.Formatting = Formatting.Indented;
xw.WriteStartDocument();
xw.WriteStartElement("DTSConfiguration");
xw.WriteStartElement("DTSConfigurationHeading");
xw.WriteStartElement("DTSConfigurationFileInfo");
xw.WriteAttributeString("GeneratedBy", sourcePackage.CreatorName);
xw.WriteAttributeString("GeneratedFromPackageName", sourcePackage.Name);
xw.WriteAttributeString("GeneratedFromPackageID", sourcePackage.ID);
xw.WriteAttributeString("GeneratedDate", DateTime.Now.ToString("G"));
xw.WriteAttributeString("LastModifiedDate", DateTime.Now.ToShortDateString());
xw.WriteAttributeString("Description", sourcePackage.Description);
xw.WriteEndElement(); // Close the DTSConfigurationFileInfo element
xw.WriteEndElement(); // Close the DTSConfigurationHeading element
xw.WriteStartElement("Configuration");
xw.WriteAttributeString("ConfiguredType", "Property");
xw.WriteAttributeString("Path", string.Format(
"\\Package.Connections[{0}].Properties[ConnectionString]",
connectionManagerName));
xw.WriteAttributeString("ValueType", "String");
xw.WriteElementString("ConfiguredValue",
sourcePackage.Connections[connectionManagerName].ConnectionString);
xw.WriteEndElement(); // Close the Configuration element
xw.WriteEndElement(); // Close the DTSConfiguration element
xw.Flush();
xw.Close();
}
This sample code performs a set of tasks to update the
selected packages to use indirect XML configurations, including:
- Create an XML configuration file using the
ConnectionString property for the selected connection manager. The
ConnectionString value (as well as other values used in the config file) is
taken from the first package in the packages collection. As all packages with a
common connection manager are likely to use the same ConnectionString, this is
a simplistic but generally successful approach.
- Create a machine environment variable that
contains the path to the XML configuration file.
- Ensure that configurations are enabled for each
package in the packages collection.
- Add a new indirect XML configuration to each
package in the packages collection, referencing the environment variable to
locate the XML configuration file.
- Save each package in the packages collection.
Once this batch update is completed, the hard work is done.
All that remains to be done is to remove the data sources from the packages.
To remove the data sources from all packages in a given
project, the project file, not the packages, must be modified – remember that
the data sources are defined in the project file itself. The PacMan Components
project includes a DtProjHelper class, which implements a DeleteDataSources
method that deletes the DataSources elements from the project file:
public static void DeleteDataSources(FileInfo projectFile)
{
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(projectFile.FullName);
XmlElement projectNode = xmlDoc.DocumentElement;
XmlNode dataSourcesNode = projectNode.SelectSingleNode
("DataSources");
dataSourcesNode.RemoveAll();
xmlDoc.Save(projectFile.FullName);
}
Although the Integration Services .NET API provides all of
the core functionality necessary to update packages to add package
configurations, the PacMan utility provides a valuable head start for batch
updating multiple packages. You can download the source code for the PacMan
utility from the CodePlex web site, and modify it to serve your own needs. The
techniques it enables (as demonstrated in the samples above) can be used to
solve a wide range of batch package update problems, far beyond data sources
and configurations.
Conclusion
In summary, Integration Services is an enterprise ETL
platform more than ready for large projects and large project teams. Data
sources are useful for smaller projects, but quickly fall short when used in
this larger context in which Integration Services excels. Replacing data
sources with package configurations gives Integration Services developers a
reliable team-ready tool for managing different development, test and
production environments. By using the Integration Services .NET object model –
perhaps by using the PacMan utility – large numbers of packages can be updated
reliably in batch mode without introducing the human errors associated with
manually updating each package. This approach provides a simple update path for
projects that are currently using data sources.
About the author. Microsoft
SQL Server MVP Matthew Roche is a
Data Architect with systems management innovator Configuresoft, Inc., and is
also Chief Software Architect of Integral Thought & Memory, a BI-focused
consultancy. Matthew has worked with Microsoft SQL Server since the mid 90s,
starting with SQL Server 6.0 and 6.5, and has been falling deeper and deeper in
love with SQL Server with each passing year. As a Microsoft Certified Trainer,
Matthew has taught hundreds of database administrators and database developers
how to better utilize SQL Server’s capabilities. As a software developer and
architect, Matthew has built dozens of applications that use the SQL Server
platform. Matthew has been using SQL Server 2005 since its early beta days and
has been working with SQL Server 2008 since early 2007. Matthew’s current focus
is on the SQL Server Business Intelligence stack, specifically SQL Server
Integration Services. When not delivering training, presenting on BI topics at
conferences, or listening to the world’s loudest heavy metal, Matthew is hard
at work helping design and build Configuresoft’s Configuration Intelligence
Analytics (CIA) platform for bringing the power of BI to enterprise IT.