Extracting Data from Operational Systems
Data that will be used in a data warehouse must be extracted from the operational systems that contain the source data. Data is initially extracted during the data warehouse creation, and ongoing periodic extractions occur during updates of the data warehouse. Data extraction can be a simple operation, if the source data resides in a single relational database, or a very complex operation, if the source data resides in multiple heterogeneous operational systems. The goal of the data extraction process is to bring all source data into a common, consistent format so it can be made ready for loading into the data warehouse.
It is better if data in the source operational systems does not contain validation errors. For example, purchase records for which there are no corresponding customer records to identify the purchasers are clearly errors in the source data, and you should correct them in the source operational system before the data is extracted for loading into the data warehouse. You may be able to implement error checking in the source operational system so such errors can be detected before extracting data for the data warehouse. If such errors are frequent, you may need to have the operational system examined and modified to reduce such errors because such errors may affect the organization's business as well as its data warehouse.
You may not be able to identify validation errors until the data has been extracted from the operational systems. This situation can occur when data is extracted from multiple data sources. For example, reconciling data extracted from separate sales tracking, shipping, and billing systems may uncover discrepancies that must be addressed in one or more of the source systems.
You may also identify inconsistencies other than errors in data after it has been extracted. For example, different data sources may use different coding systems for the same kind of data. You can often use translation tables to reconcile these differences during the extraction operation or later during transformation operations. For example, a legacy system may code state or province names using a three-character code, whereas another system may use a two-character code. The data from one or both of these systems must be translated into a single set of codes before loading the data into the data warehouse.
In other cases, you may discover inconsistencies if source systems permit free-form entry of text information. Such data is often internally inconsistent because different data-entry personnel may enter the same data in different ways. Inconsistent representations of the same data must be reconciled if such data is to be used for analysis. For example, in a data source that permits free-form text entry for the state or province portion of an address, the state of Florida may be entered as FL, Fla, Florida, or even Flor. It may be difficult to modify legacy source systems to implement a standard coding validation. Manual transformation adjustments may be necessary to reconcile such differences if the contributing source systems cannot be modified.
You can use the powerful transformation capabilities of Data Transformation Services (DTS) in Microsoft® SQL Server™ 2000 during the extraction process to reconcile many formatting, data encoding, and other inconsistencies. Other transformations must be accomplished after the data has been extracted from the source systems.
Some of the tools available in SQL Server 2000 for extracting data are:
- Distributed queries
- Command line applications
- bcp utility
- BULK INSERT statement for loading from text files
- ActiveX scripts
In some data warehouse implementations, you may also find that you can use Replication to copy data from source systems to the data preparation area.