How to: Use the Data Profiling Task (SQL Server Video)

Applies to: SQL Server 2008 Integration Services

Authors: Douglas Laudenschlager, Microsoft Corporation

Length: 00:10:12

Size: 8.96 Mb

Type: WMV file

Watch this Video on TechNet, to watch, share, and download the video in multiple formats.

Watch this video

Related help topics:

Data Profiling Task

Profiling Data with the Data Profiling Task and Viewer

Video Summary

Learn how to use this powerful new task in SQL Server 2008 to become familiar with an unfamiliar database, or to look for problems in existing data. Take a quick look at all eight of the profiles that the task can compute.

Video Transcript

Hello. My name is Douglas Laudenschlager, and I work on the documentation team for Microsoft SQL Server Integration Services.

Today, you're going to learn how you can familiarize yourself with an unfamiliar database, or look for problems in existing data, by using the Data Profiling Task, which is new in Integration Services of SQL Server 2008.

You will learn how to:

  • Configure and run the Data Profiling task in an Integration Services package.

  • Run the stand-alone Data Profile Viewer to view the output of the task.

  • And, understand and analyze the output of the task that you see in the Data Profile Viewer.

Here we are in Business Intelligence Development Studio, where we've already created a new Integration Services project and opened the new package in the designer. We'll locate the Data Profiling Task among the Control Flow items in the Toolbox, and drag it onto the design surface.

Now, before we can configure the task itself, we need to set up the two connection managers that the task expects for its input and its output. The Data Profiling task takes its input from an ADO connection, so we will create a new ADO.NET connection manager that points to the AdventureWorks sample database. The Data Profiling task uses only ADO.NET connections, and in this release, it profiles only Microsoft SQL Server data sources. The DPT sends its output to a file in XML format, so we also need a File connection manager. Here we will configure our File connection manager to overwrite an existing output file from a previous run. And now we're prepared to configure the DPT itself.

When we open the editor for the DPT, our first task is to assign the File connection manager that we just created as the destination for the task's output. Because we want to overwrite an existing file, we also change the value of the OverwriteDestination property from False to True.

If we look on the Profile Requests page, we see that there are eight different profile types that the task can compute. For many of these, there are numerous options that you can specify. However, all these options will make more sense to you if we look first at sample output from the DPT. So, let's take a shortcut.

We'll come back here to the General tab and pick the Quick Profile option. What quick profile does is run up to seven of the available profiles with default options against a single table of our choosing. For our demonstration, we'll select the Person.Address table in the AdventureWorks database, and select all seven of the profiles that we can run.

Now, the task is configured and ready to go. Because it takes a minute to run, we're going to take another shortcut and look at the output from a previous run. For that purpose we need to use the Data Profile Viewer, which is a separate stand-alone application. Here I've just opened our output file in the Data Profile Viewer. Let's expand the tree here in the Profiles pane on the lefthand side to see the structure of our data and the profiles that have been computed.

Before we take a closer look at the output itself, let's talk for a moment about the panes that you see here in the Data Profile Viewer window. On the left in the Profiles page, you see the structure of your data and the profiles that you selected to run. At the top right, in the Results pane, you see a summary of the results of the profile, usually just a single line. In the center right, in the Details pane, you see rich detail about your data that you can easily sort, that's presented in both text and graphical format. And, if the data source against which you ran your profile is still available, you can view the details of each set of data in the drill down pane at the bottom right.

Now, let's look at the profiles that are available to us. The DPT can compute five profiles that examine individual columns, and an additional three profiles that look at the relationships between columns.

For a single column, the simplest profile of all is the Column Null Ratio profile, which computes the percentage of nulls in a given column, and can help you to identify null values where there should not be any. Of course, you would not want null values in a Postal Code column, and this profile result tells us that, in the Address table in AdventureWorks, there are in fact no null values.

You can also compute a Column Length Distribution Profile, which shows you the shortest and longest length of strings in a column. This profile can help you to identify unacceptable string values that are shorter or longer than the column requires. Here we see that the non-null entries for AddressLine2 in AdventureWorks range from 1 character to 28 characters in length. The most common length is 5 characters, and if we drill down, we see that this is typically apartment numbers.

We can also compute a Column Value Distribution Profile, which tells us, for example, that in the AddressLine2 column in AdventureWorks, there are 195 distinct values. This can help to alert us if there are values that are incorrect or out of range, for example, if you found more than 50 values in a lookup table of the 50 US states. So, if this column contains 195 distinct values, why do we see only one listed here in the Details pane? That's because the default settings for this profile return details only for values that represent more than one tenth of 1% of the data. This is one of the many options that you can set when you configure the task.

The Column Statistics Profile, which is computed for numeric and date columns, shows us minimum value, maximum value, mean, and standard deviation for a number column. When you look at column statistics for a date column, you see the earliest and latest date in the range of dates. Again, this can help to alert you to numbers or dates that are out of range.

The final profile that can be computed for an individual column is the Column Pattern Profile. This is a more unusual profile that returns a set of regular expressions that cover all the values in the column. You could take these regular expressions and use them in a custom application, either to validate existing data, or to validate user input before data enters the database.

Now, let's look at the profiles that analyze the relationships between columns.

The Candidate Key Profile identifies columns with a high degree of uniqueness, which would be candidates to become a primary key. If there were violations of the uniqueness, we would also see those here. Of course, there are no violations here, since the uniqueness of this key is already enforced by a constraint.

We don't see the Value Inclusion Profile here in our sample output because it's not run by the Quick Profile option, but it looks at foreign key relationships in a way that's similar to what the Candidate Key Profile does for primary key relationships.

The Functional Dependency Profile is one of the more powerful. Let's take a look at an example of a functional dependency. For a given Postal Code, you would always expect the State or Province to be predictable and unchanging. Here we see, however, in our output that that is only 99% true in AdventureWorks, telling us right away that we have some invalid values for StateProvinceID. If we look at one of the postal codes that has violations, we can drill down to see the rows that have the correct value, and the rows that have an incorrect value.

In this video, you learned how you can familiarize yourself with an unfamiliar database, or look for problems in existing data, by using the Data Profiling Task, which is new in Integration Services of SQL Server 2008.

You also learned how to:

  • Configure and run the Data Profiling task in an Integration Services package.

  • Run the stand-alone Data Profile Viewer to view the output of the task.

  • And, understand and analyze the output of the task that you see in the Data Profile Viewer.

We hope that you've learned new things and useful skills from this video. After you close this video and return to the Web page, you'll find some other Integration Services videos that are available for you to watch. Thank you.