SQL Server 2000

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

SQL Essentials: SQL Server Data Importing: Do's and Don'ts

Ron Talmage

One of the first relational DBMSs I worked with was Microrim's R:Base 4000. R:Base, unlike its PC competitor dBase, was truly relational and had been developed in the early 1980s as a PC version of the NASA RIM (Relational Information Management) system. At any rate, one of the features I most appreciated was the way it let you view sample data during the import process. Although tools like bcp, DTS, the new SQL Server 2005 Integration Services, and various migration tools and wizards have automated the process of importing data into SQL Server, that doesn't mean we should leave our brains on the table. This month, Ron Talmage provides some great common-sense advice about importing data.

SQL Server DBAs often find themselves importing and massaging data using T-SQL. Why? Because some transformations of the data simply require the power of full-fledged SQL. Having just recently survived yet another case of data importing, I was inspired to assemble a list of do's and don'ts that work for me.

Do load raw data into staging with varchar data types
Raw data from so-called legacy systems is typically delivered in text format, so I invariably start by loading the raw data into a separate staging database. I never try loading data directly into a production database.

	What I do is load all the raw text data into corresponding raw tables that have columns with the varchar data type. (DTS will do this automatically, and that's a good thing. However, DTS will also name the columns as COL001 and so forth if you don't supply column names ahead of time.) The main benefit of varchar is that it lets everything in–even the "bad" data. If you try to load data from a legacy system that didn't have good checks on user-entered data, you risk having more data skipped or written out to exception files than loaded unless you accept just about every possible value. A character load into varchar columns will do just that.

Don't use non-alphanumeric characters in staging table/column names
You may not have control over how columns are initially named in the tables containing raw data, but I try to revise legacy column names that may contain spaces or other irregular characters. When column or table names contain non-alphanumeric characters, we're forced to delimit them with square brackets or double quotes. Not only is that kind of code harder to write, it's also harder to read.

Don't use keywords in column names
Often, data from legacy systems has descriptive column names that can wreak havoc with your SQL queries. For example, real-estate data might have a column called KEY, reflecting the key box placed on houses for sale. However, KEY is also a keyword (!) in T-SQL, and down the road, queries will fail if they reference that column name directly. You'll end up having to delimit the keyword column names with square brackets or double quotes.

Do create a staging table with correct data types
The next step is to create one or more additional staging tables that have the "correct" data types. I like to have staging tables with the same column names as the destination tables in the target OLTP database. In any case, what's important is that the data types of each column in the raw data will be checked and corrected as it's loaded into the staging. It's a lot easier to find bad data in a SQL Server table than it is to find the same data from an external file that failed to load.

Do add new columns to the staging tables
When the staging data doesn't have the appropriate columns, you can add them and then split or merge the loaded data. For example, an address might be loaded as a simple string into a staging table, even though the destination tables break out street name and number. Then you can add the street name and number columns to the staging table, and parse the legacy address into the two columns. This has the advantage that the original data exists alongside the newly split data, and you can test your scripts just by comparing the columns.

Do test populating production data with local copies
When you're ready to insert data from your staging tables, test it first by inserting into local copies of the production tables. Sometimes you only need empty tables; other times, they'll have to be populated.

Do keep production constraints
Always keep the production constraints on the copied tables. That way, you can test how well your staging data will satisfy those constraints. These constraints consist of NULL, default, check, primary key, and foreign key constraints. Start by making sure your copied tables have the same NULL or NOT NULL properties on their columns as the target system, and then work up through all the remaining constraints. If your tests show that the staging data insert process satisfies all those constraints, you're that much closer to success.

Do test on a copy of production data
While inserting imported data into empty tables will catch a lot of potential problems, it can't catch them all. When you've passed all the previous tests, make sure you test out your importing on a copy (or at least a reasonable subset) of the target data, or production system. The last types of errors you can receive will be due to the configuration of the data, and that's what this test can detect. You can then inspect the results in the copy database, and perhaps even redirect the application to it for further testing and verification. [Columnist Tom Moreau adds, "Test with a daily refresh of production leading up to the date of migration. If the old system has insufficient constraints and the new system does, then bad data will enter the old system and break your migration."–Ed.]

	If your import process passes all tests up to this point, you may be ready to import your data, or at least to hand off the import process to QA.

To find out more about SQL Server Professional and Pinnacle Publishing, visit their Web site at http://www.pinpub.com/

Note: This is not a Microsoft Corporation Web site. Microsoft is not responsible for its content.

This article is reproduced from the June 2005 issue of SQL Server Professional. Copyright 2005, by Pinnacle Publishing, Inc., unless otherwise noted. All rights are reserved. SQL Server Professional is an independently produced publication of Pinnacle Publishing, Inc. No part of this article may be used or reproduced in any fashion (except in brief quotations used in critical articles and reviews) without prior consent of Pinnacle Publishing, Inc. To contact Pinnacle Publishing, Inc., please call 1-800-788-1900.