Tuning Your SSIS Package Data Flow in the Enterprise (SQL Server Video)
This video demonstrates how to improve the performance of the data flow in an Integration Services package. In this video, you will learn how to tune the following phases of the data flow:
You can apply these performance tuning tips when you design, develop, and run the data flow.
Video Time Stamp
Hi, this is David Noor, Senior Development Lead on SQL Server Integration Services here at Microsoft. Welcome to this video on tuning your SSIS package data flow in the enterprise.
This video is the second part of a four-part series titled, SSIS: Designing and Tuning for Performance. In part one of the series, Denny took a look at how to best measure and understand the performance of SSIS packages. In this installment, we’ll build on that, and look at how to improve the performance of your SSIS package data flow. We’ll start by identifying the common themes in every data flow, and how to choose which parts of your data flow you should work on first when trying to improve performance. Once you understand where performance problems are, there’s a lot you can do in your data flow to make things run faster and more efficiently. We’ll look at a series of specific tips you can apply when designing, developing, and even executing your dataflow.
Let’s get started!
Most package data flows focus on extracting, transforming, and loading critical data, sometimes known as ETL. Tuning your data flows means tuning each of these phases of data flow: extraction, transformation, and loading.
For extraction, we’ll talk about how to tune SSIS source components and the queries they run to give you the best performance and minimum impact on your source systems.
Tuning the transformations in your data flow means looking at the work you need to accomplish, and choosing the optimal approach to accomplish that work by removing unnecessary steps, and sometimes changing the original source queries to do transformations for you.
Lastly, tuning the loading side of your data flow means tuning SSIS destination components, networks, locking, and other factors which can impede your data flow’s progress as it attempts to load data into your destination.
In this video, we’ll talk about practical tips you can use to look for tuning opportunities in your data flow in all three phases.
Before we begin looking at these tips, it’s important to remember that there’s usually no silver bullet when dealing with performance problems. No one tip here is likely to fix all of the performance issues your data flow may be having. The tips we’ll show work well in common cases, but you’re not going to want to just throw a lot of tips at your problem. Your best bet is always to understand what the intent of your data flows are, make sure they’re designed well to meet those goals, measure their current performance, and iteratively make changes to your data flow and measure whether or not your change has improved performance.
Also when you’re looking at a performance problem, it’s easy to become fixated on a particular technology or component, sometimes because of familiarity with the technology or the code. But whatever the reason, try to think of your data flow in the context of an entire ecosystem—most likely, multiple relational databases, a network, operating systems, file systems, lots of pieces. The more you know about this whole ecosystem, the more complete of a picture you’ll have as to the real performance challenges any one piece is facing, and you’ll be able to make more effective, less risky changes.
We’ll start where your data flow does—with extraction. When using SQL Server or any other database with a bulk interface as your source, you should experiment with increasing the packet size. Normally, the SQL Server default of 4096 works well, but since your extraction is going to be moving large amounts of data, bumping this up should help. For this to take effect, you’ll want to ask your network admin to enable “Jumbo Frames” on your network as well. You’ll want to test the effect of this on your package, though. If you’re using the same connection manager for bulk operations like an OLE DB source and for single-row operations (like an OLE DB command), you might consider creating a second connection manager for the OLE DB command operations, and use a lower packet size there.
As noted, if your ETL is running on Windows 2008 and you’re on a multi-core machine with multiple NICs, you can achieve a bit better network performance by affinitizing network cards to individual cores. Look for a blog post entitled, Scaling Heavy Network Traffic with Windows, on the MSDN site for more information.
One of the easiest things to look for in tuning extraction is to tune the queries you’re using. It’s tempting to sometimes just select the table you want to extract from and allow all of the columns to be retrieved, but you’ll get far better results if you select only the columns you really need. Not only will this result in less network traffic and memory used, but the database server may be able to do much less I/O to satisfy your query.
As shown here, for SQL Server, you can also use a hint to indicate that no shared locks should be issued while doing the select, so your query will read potentially uncommitted or dirty data. Reserve this for cases when you absolutely must have the best performance, and you’re sure that reading dirty data will always be okay for your ETL job.
Lookups can be thought of as extraction or transformation. But either way, you should use some of the same ideas as we showed on the previous slide. Here, it’s even more critical to select just the columns you need, not only for network optimization, but for optimization of your in-memory cache as well.
If you’re using SQL Server 2008, one of the best features for performance is the addition of the new, shared lookup cache. Sharing a lookup cache allows you to fetch your reference data once, and then re-use it across multiple lookup operations in your package, or even across packages by saving the cache to a file. If you have multiple lookup components that reference the same table, you’ll want to check out this new option as a means to greatly increase the speed of your packages. Using it can be pretty simple. Create a Cache connection manager in your package, populate the cache using the new cache transform, and then change your lookups to refer to this cache connection for their reference data.
Let’s move on to the second phase of ETL: transformation. In transformation, you’re working with data already loaded into the data flow and attempting to get it into the right shape before it’s loaded. In SSIS, every transformation appears as an object you can drag onto your design, but not all transformations are made equal. Let’s look at the three kinds of transformations that exist in SSIS:
With the basics covered, let’s look at a few other specific tips:
Let’s move on to the loading phase.
When loading into SQL Server, you have two performant options:
For all of these, using a commit size of 0 will result in the fastest load.
It’s a pretty common practice to drop indexes in your destination system as well when doing large loads, but you should use some guidelines to inform you as to when to do it. A common guideline is to choose to drop indexes based on the percent increase the load is expected to cause, and the kinds of indexes you have on the table:
If you’re loading into a table that has other activity going on it, you’ll need to devise a strategy. Bulk loads will likely lock down most if not all of the destination table. So, you’ll want to make sure that’s okay, or use partitioning. If you need to load against an operational database, you may be able to set up a partition to load into so that current operational data can stay active. If your load is going slowly, you’ll want to make sure to check out what other activity is occurring against that table or partition, and make sure nothing else is contending with it.
For an excellent guide on increasing performance of bulk loads and partitions, look for the SQLCAT article entitled, The Data Loading Performance Guide, available on MSDN.
Also when doing reloads, make sure to use TRUNCATE, not DELETE to clear out data, so that the delete won’t be transactional.
Destinations that use a network connection are also subject to the same network issues I described earlier. Look at increasing the packet size and enabling "Jumbo Frames" on your network to get optimal performance at load time.
Well, that’s it for part two of this performance series. Special thanks to Thomas for all of his helpful information that formed the base for this video series, and to Carla and Douglas for all of their help in creating this series. For more information on these topics, please refer to the Top 10 SQL Server Integration Services Best Practices.
Thank you for watching this video. Be sure to check out the other three parts of this video series titled, SSIS: Designing and Tuning for Performance. We hope that you have found this of value and we’d really like to hear what you think. On the video home page, look for the link to rate and give feedback in the upper-right hand corner.