The Microsoft Data Warehousing Strategy
July 11, 1998
The Data Warehousing Process
Microsoft Data Warehousing Framework
An Information Interchange Standard—OLE DB
Metadata: The Glue that Binds the Data Warehouse
Designing the Data Warehouse
Import/Export and Transform Data
Analysis and Presentation of the Data
Making better business decisions quickly is the key to succeeding in today's competitive marketplace. Understandably, organizations seeking to improve their decision-making can be overwhelmed by the sheer volume and complexity of data available from their varied operational and production systems. Making this data available to a wide audience of business users is one of the most significant challenges for today's information technology professionals.
In response, many organizations choose to build a data warehouse to unlock the information in their operational systems and understand real-world business problems. The data warehouse is an integrated store of information collected from other systems and becomes the foundation for decision support and data analysis. While there are many types of data warehouses, based on different design methodologies and philosophical approaches, they all have these common traits:
- Information in a data warehouse is organized around the major subjects of the enterprise (for example, customers, products, sales, or vendors), reflecting a data-driven design.
- Raw data is gathered from nonintegrated operational and legacy applications, cleansed, and then summarized and presented in a way that makes sense to business users.
- Based on feedback from users and discoveries in the data warehouse, the architecture of the data warehouse will change over time—reflecting the iterative nature of the process.
The data warehousing process is inherently complex, and as a result has been historically costly and time-consuming. Over the past several years, Microsoft has worked with the industry to create a platform for data warehousing, consisting of both component technology and leading products, that can be used to lower the costs and improve the effectiveness of data warehouse creation, administration, and usage. Microsoft has also been developing a number of products and facilities such as Microsoft® SQL Server™ version 7.0 that are well suited to the data warehousing process. Coupled with third-party products that can be integrated using the Microsoft Data Warehousing Framework, customers have a large selection of interoperable, best-of-breed products from which to choose for their data warehousing needs.
Microsoft SQL Server 7.0 offers a great breadth of functionality to support the data warehousing process. In conjunction with the Data Warehousing Framework, Microsoft is delivering a platform for data warehousing that reduces the costs, simplifies the complexity, and improves the effectiveness of data warehousing efforts.
The data warehousing process is iterative in nature, and requires constant change over the life span of the application.
From the information technology perspective, data warehousing is aimed at the timely delivery of the right information to the right individuals in an organization. This is invariably an ongoing process, not a one-time solution, and requires a different approach from that required in the development of transaction-oriented systems.
A data warehouse is a collection of data in support of management's decision-making process that is subject-oriented; integrated; time-variant; and nonvolatile (see W.H. Inmon, Building the Data Warehouse, 1992). Simply put, this means that the data warehouse is focused on a business concept (for example, sales) rather than a business process (for example, issuing invoices), and contains all the relevant information on the concept gathered from multiple processing systems. This information is collected and represented at consistent periods of time and is not changing rapidly.
The data warehouse integrates operational data through consistent naming conventions, measurements, physical attributes, and semantics. The first step toward building the data warehouse is a management process: determining which subject areas should be included and developing a set of agreed-upon definitions. This requires interviews with end users, business analysts, and executives to understand and document the scope of the information requirements. Only after a thorough understanding of the business issues can the logical process be translated into a physical data warehouse.
Following the physical design, systems are put in place to populate the data warehouse from operational systems on an ongoing basis. Because the operational systems and the data warehouse have different representations of the data, populating the data warehouse requires transformations of the data: summarization, translation, decoding, elimination of invalid data, and so on. These processes need to be automated so that they can be performed on an ongoing basis: extracting, transforming, and moving the source data as often as needed to meet the business requirements of the data warehouse.
In the operational environment, data is current valued and accurate as of the moment of access. For example, an order entry application always shows the current value of inventory on-hand for each product. If one were to inquire about the quantity of a given product in inventory, the results could differ between two queries issued just moments apart. However, data in the warehouse represents information over a long period and is expected to be accurate as of a particular point in time. In effect, the warehouse contains a long series of snapshots about the key subject areas of the business.
Finally, information is made available to the business analysts and executives for browsing, analysis, and reporting. Many tools can be used to assist in the analysis stage, from simple report writers to advanced data mining tools. Ultimately, however, the analysis effort drives the final iterations of the data warehousing process: revisions in the design of the data warehouse in order to accommodate new information, improve system performance, or allow new types of analysis. With these changes the process begins again, and continues through the life of the data warehouse.
Many methodologies have been proposed for the design of data warehouses to simplify the information technology efforts required to support the process on an ongoing basis. This has led to many debates over the best architecture for delivering data warehouses in organizations.
Data Warehouse Architectures
Notwithstanding the many approaches to implementing decision support systems in practice, there are two basic types of data warehouses: enterprise data warehouses and data marts. Each have their proponents, as well as their respective strengths and weaknesses.
The enterprise data warehouse contains corporate-wide information integrated from multiple operational data sources for consolidated data analysis. Typically it is composed of several subject areas such as customers, products, and sales and is used for both tactical and strategic decision making. An enterprise warehouse contains both detailed point-in-time data and summarized information and can range from 50 gigabytes to more than one terabyte in total data size. Enterprise data warehouses can be very expensive and time-consuming to build and manage. They are usually created by centralized information services (IS) organizations from the top down.
Data marts contain a subset of corporate-wide data that is built for use by an individual department or division of an organization. Unlike the enterprise warehouse, data marts are often built from the bottom up by departmental resources for a specific decision support application or group of users. Data marts contain summarized and often detailed data about the subject area. This information in a data mart can be a subset of an enterprise warehouse (dependent data mart) or more likely come directly from the operational data sources (independent data mart).
Regardless of the architecture, data warehouses and data marts are constructed and maintained through the iterative process described earlier. Furthermore, both approaches share a similar set of technological requirements.
Data Warehousing Components
A data warehouse always consists of a number of components, which can include:
- Operational data sources.
- Design/development tools.
- Data extraction and transformation tools.
- A database management system.
- Data access and analysis tools.
- System management tools.
Microsoft recognizes that not all of these components will be Microsoft products in a customer's implementation. In fact, more often than not, a data warehouse will be constructed using a wide variety of tools from a number of vendors, coupled with custom programming.
Several years ago, Microsoft recognized the crucial need for a set of integrating technologies that allows these many vendors' products to work together easily. This recognition led to the creation of the Microsoft Data Warehousing Framework, a roadmap not only for the development of Microsoft products such as SQL Server 7.0, but also the technologies necessary to integrate products from many other vendors, including both Microsoft business partners and competitors.
The goal of the Data Warehousing Framework is to simplify the design, implementation, and management of data warehousing solutions. This framework has been designed to provide:
- An open architecture that is integrated easily with and extended by third-party vendors.
- Heterogeneous data import, export, validation, and cleansing services with optional data lineage.
- Integrated metadata for warehouse design, data extraction/transformation, server management, and end-user analysis tools.
- Core management services for scheduling, storage management, performance monitoring, alerts/events, and notification.
The Microsoft Data Warehousing Framework is the roadmap for product development and integration on the Microsoft platform.
Figure 2. The Microsoft Data Warehousing Framework
The Data Warehousing Framework has been designed from the ground up to provide an open architecture that can be extended easily by Microsoft customers and business partners using industry-standard technology. This allows organizations to choose best-of-breed components and still be assured of integration.
Ease-of-use is a compelling reason for customers and independent software vendors to choose the Microsoft Data Warehousing Framework. Microsoft provides an object-oriented set of components that are designed to manage information in the distributed environment. Microsoft is also providing both entry-level and best-of-breed products to address the many steps in the data warehousing process.
Data Warehousing Framework Components
The Data Warehousing Framework describes the relationships between the various components used in the process of building, using, and managing a data warehouse. The core of the Microsoft Data Warehousing Framework is the set of enabling technologies comprised of the data transport layer (OLE DB) and the integrated metadata repository. These two technologies allow for the interoperability of the many products and tools involved in the data warehousing process.
Building the data warehouse requires a set of tools for describing the logical and physical design of the data sources and their destinations in the data warehouse or data marts. Operational data must pass through a cleansing and transformation stage before being placed into the data warehouse or data marts in order to conform to the definitions laid out during the design stage. This data staging process is often many levels deep, especially with enterprise data warehouse architectures, but is necessarily simplified in Figure 2.
End user tools, including desktop productivity products, specialized analysis products and custom programs are used to gain access to the information in the data warehouse. Ideally, user access is through a directory facility that enables user search for appropriate and relevant data to resolve business questions, and provides a layer of security between the users and the back-end systems.
Finally, a variety of tools can come into play for the management of the data warehouse environment, such as scheduling repeated tasks and managing multiserver networks.
Microsoft Repository provides the integration point for the metadata (that is, data about the data) shared by the various tools used in the data warehousing process. Shared metadata allows for the transparent integration of multiple tools from a variety of vendors, without the need for specialized interfaces between each of the products. The Microsoft Repository architecture is described in detail elsewhere in this document.
Similarly, OLE DB provides for standardized, high-performance access to a wide variety of data and allows for integration of multiple data types. OLE DB is described in detail in the following section.
[Editor's note: This section, ending at the last sentence before the section "Metadata: The Glue that Binds the Data Warehouse," is excerpted from "Microsoft Strategy for Universal Data Access," Microsoft Corporation, October 1997.]
Accessing the variety of possible data sources requires easy connectivity and interoperability with heterogeneous databases and presents one of the most significant technical problems in data warehouse implementation. The Data Warehousing Framework relies on Universal Data Access, Microsoft's data transport standard, and on the OLE DB interfaces. Universal Data Access is a platform, application, and tools initiative that defines and delivers both standards and technologies and is a key element in Microsoft's foundation for application development, the Microsoft Windows® Distributed interNet Applications (DNA) architecture.
Universal Data Access provides high-performance access to a variety of data and information sources on multiple platforms and an easy-to-use programming interface that works with practically any tool or language, leveraging the technical skills developers already have. The technologies that support Universal Data Access enable organizations to create easy-to-maintain solutions and use their choice of best-of-breed tools, applications, and data sources on the client, middle-tier, or server.
A Unified Data Access Model Based on COM
One strength of Microsoft's Universal Data Access strategy is that it is delivered through a common set of modern, object-oriented interfaces. These interfaces are based on Microsoft's Component Object Model (COM), the most widely implemented object technology in the world. COM has become the choice of developers worldwide because it provides:
- The richest integrated services, including transactions, security, message queuing, and data access, to support the broadest range of application scenarios.
- The widest choice of tools from multiple vendors using multiple development languages.
- The largest customer base for customizable applications and reusable components.
- Proven interoperability with users' and developers' existing investments.
Because of the consistency and interoperability afforded through COM, Microsoft's Universal Data Access architecture is open and works with virtually any tool or programming language. It also enables Universal Data Access to provide a consistent data access model at all tiers of the modern application architecture.
Microsoft Universal Data Access exposes COM-based interfaces optimized for both low-level and high-level application development: OLE DB and ActiveX® Data Objects (ADO).
Definition of OLE DB
OLE DB is Microsoft's strategic system-level programming interface to manage data across the organization. OLE DB is an open specification designed to build on the success of Open Database Connectivity (ODBC) by providing an open standard for accessing all types of data. While ODBC was created to access relational databases, OLE DB is designed for accessing relational and nonrelational information sources, including:
- Mainframe ISAM/VSAM and hierarchical databases.
- E-mail and file system stores.
- Text, graphical and geographical data.
- Custom business objects.
- And more.
OLE DB defines a collection of COM interfaces that encapsulates various database management system services. These interfaces enable the creation of software components that implement such services. OLE DB components consist of data providers (that contain and expose data), data consumers (that use data), and service components (that process and transport data, for example, query processors and cursor engines).
OLE DB interfaces are designed to help components integrate smoothly so that OLE DB component vendors can bring high-quality OLE DB components to the market quickly. In addition, OLE DB includes a bridge to ODBC to enable continued support for the broad range of ODBC relational database drivers available today.
Definition of ActiveX Data Objects
ActiveX Data Objects (ADO) is Microsoft's strategic, application-level programming interface to data and information. ADO provides consistent, high-performance access to data and supports a variety of development needs, including the creation of front-end database clients and middle-tier business objects, using applications, tools, languages, or Internet browsers. ADO is designed to be a data interface for one-to-multitier, client/server, and Web-based solution development.
Figure 3. Universal Data Access architecture
ADO provides an easy-to-use application-level interface to OLE DB, which provides the underlying access to data. ADO is implemented with a small footprint, minimal network traffic in key scenarios, and a minimal number of layers between the front end and data source, which provides a lightweight, high-performance interface. ADO is easy to use because it is called using the COM automation interface: a familiar metaphor that is available from all leading RAD, database tools, and languages on the market today. And because ADO was designed to combine the best features of, and eventually replace, Remote Data Objects (RDO) and Data Access Object (DAO), it uses similar conventions with simplified semantics to make it a natural next step for today's developers.
One of the biggest implementation challenges today is integrating all of the different tools required to design, transform, store, and manage a data warehouse. The ability to share and reuse metadata, that is, data about the data, dramatically reduces the cost and complexity of building, using and managing data warehouses. Many data warehousing products include a proprietary metadata repository that cannot be used by any other components in the warehouse. Each tool must be able to access, create, or enhance the metadata created by any other tool easily, while also extending the metadata model to meet the specific needs of the tool.
Consider the example of a data warehouse built with the presence of shared metadata. Metadata from operational systems is stored in the repository by design and data transformation tools. This physical and logical model is used by transformation products to extract, validate, and cleanse the data prior to loading it into the database. The database management system used may be relational, multidimensional, or a combination of both. The data access and analysis tools provide access to the information in the data warehouse. The information directory integrates the technical and business metadata to make it easy to find and launch existing queries, reports, and applications for the warehouse.
The Data Warehousing Framework is centered upon shared metadata in Microsoft Repository, which is provided as a component of Microsoft SQL Server 7.0. Microsoft Repository is a database that stores descriptive information about software components and their relationships. It consists of an Open Information Model (OIM) and a set of published COM interfaces. The Open Information Models are object models for specific types of information, and are flexible enough to support new information types, as well as extensible to fit the needs of specific users or vendors. Microsoft has already developed OIMs in collaboration with the industry for Database Schema, Data Transformations, and Online Analytical Processing (OLAP). Future models will include replication, task scheduling, semantic models and an information directory that combines business and technical metadata.
The Metadata Coalition, an industry consortium of 53 vendors dedicated to fostering a standard means for vendors to exchange metadata, has announced support for Microsoft Repository, and the Repository OIMs have received broad third-party independent software vendor (ISV) support.
The development phase of the data warehousing process often begins with the creation of a dimensional business model that describes the important metrics and dimensions of the selected subject area based on user requirements. Unlike online transaction processing (OLTP) systems that organize data in a highly normalized fashion, the data in the warehouse is organized in a very denormalized manner to improve query performance when stored in a relational database management system.
Figure 4. An example of a star schema. In this type of database schema, a central "fact" table is linked to related attribute—or dimension—tables.
Relational databases often use star and snowflake schemas to provide the fastest possible response times to complex queries. Star schemas contain a denormalized central fact table for the subject area and multiple dimension tables that contain descriptive information of the subject's dimensions (source: The Data Warehousing Institute, "Star and Snowflake Schema," 1996). The central fact table can contain many millions of rows. Commonly accessed information is often preaggregated and summarized to further improve performance.
While the star schema is primarily considered a tool for the database administrator to increase performance and simplify data warehouse design, it is also a useful convention for representing data warehouse information in a way that makes better sense to business users.
The Data Warehouse "Data Store"
At the heart of the data warehouse is a database, and it is crucial to build these systems on a high-performance engine that will meet both current and future needs of the organization. Relational database management systems are the most common reservoirs of the large volumes of information held in data warehouses. Increasingly, these relational systems are being augmented with multidimensional OLAP servers that provide enhanced navigational capabilities and increased performance for complex queries. Also important are facilities for replicating databases reliably from central warehouses to dependent data marts, or to ensure consistency between mirrored data marts distributed geographically.
Scalable, Robust Relational DBMS
Microsoft SQL Server 7.0 contains a number of features that make it an excellent database platform for data warehouses and data marts, including:
- Support for terabyte-sized databases to manage the largest data warehouses.
- Scalability for very large databases, but also across the enterprise to departmental servers and even laptop computers with the 100 percent code-compatible desktop edition, providing the ultimate flexibility and access to central data sources.
- Advanced query processing to support the optimization and execution of complex queries typical in data warehouse applications, including star joins.
- Intraquery parallelism, which provides faster performance by breaking a complex single query into component parts and distributing the workload to multiple processors, including remotely linked servers.
- High performance utilities for performance tuning, data loading and index construction.
- Heterogeneous JOIN capabilities, which enables retrieval and consolidation of information from any OLE DB source.
The Microsoft SQL Server 7.0 relational database engine is appropriate for nearly every data warehouse size and complexity. However, data warehouse implementations usually require more than just a single, central database. In practice, organizations will implement decision support systems with additional analytical tools, and with distributed information architectures. Microsoft SQL Server 7.0 includes essential facilities for managing these additional tasks.
Integrated OLAP Analytical Capabilities
Online Analytical Processing (OLAP) is an increasingly popular technology that can dramatically improve business analysis. Historically, OLAP has been characterized by expensive tools, difficult implementation and inflexible deployment. Decision Support Services (DSS) is a new, fully featured OLAP capability that is provided as a component of Microsoft SQL Server 7.0. Microsoft DSS includes a middle-tier server that allows users to perform sophisticated analysis on large volumes of data with exceptional results. A second component of Microsoft DSS is a client-side cache and calculation engine called PivotTable® Service, which helps improve performance and reduce network traffic. PivotTable Service also enables users to conduct analyses while disconnected from the corporate network.
Figure 5. Microsoft DSS is a middle-tier OLAP server that simplifies user navigation and improves performance for queries against information in the data warehouse.
OLAP is a key component of data warehousing, and Microsoft DSS provides essential functionality for a wide array of applications ranging from corporate reporting to advanced decision support. The inclusion of OLAP functionality within SQL Server will make multidimensional analysis much more affordable, and will bring the benefits of OLAP to a wider audience. This includes not only smaller organizations, but also groups and individuals within larger corporations who have been excluded previously from the OLAP industry by the cost and complexity of today's products.
Coupled with the wide variety of tools and applications supporting OLAP applications through the Microsoft OLE DB for OLAP interface, Microsoft DSS will help to increase the number of organizations who have access to sophisticated analytical tools, and will reduce the costs of data warehousing.
Creating distributed dependent data marts from a central data warehouse, or even reliably duplicating the contents of an independent data mart, requires the ability to replicate information reliably. Microsoft SQL Server 7.0 includes facilities to distribute information from a central, publishing data warehouse to multiple subscribing data marts. Information can be partitioned by time, geography, and so forth as part of the replication process.
SQL Server provides a variety of replication technologies that can be tailored to your application's specific requirements. Each technology produces different benefits and restrictions across these three important dimensions:
- Transactional consistency
- Site autonomy
- Partitioning data to avoid conflicts
Requirements along and across these three dimensions will vary from one distributed application to the next.
In most decision-support applications, data will not be updated at individual sites; instead, information will be prepared at a central staging area and "pushed" out to distributed database servers for remote access. For this reason, snapshot replication is often used to distribute data.
As its name implies, snapshot replication takes a picture, or snapshot, of the published data in the database at one moment in time. Instead of copying INSERT, UPDATE, and DELETE statements (characteristic of transactional replication), or data modifications (characteristic of merge replication), Subscribers are updated by a total refresh of the data set. Hence, snapshot replication sends all the data to the Subscriber instead of sending just the changes. If the information being sent is very large, it can require substantial network resources to transmit. In deciding if snapshot replication is appropriate, one must balance the size of the entire data set against the volatility of change to the data.
Snapshot replication is the simplest type of replication, and it guarantees latent guaranteed consistency between the Publisher and Subscriber. It also provides high autonomy if Subscribers do not update the data. This type of replication is a good solution for read-only Subscribers that do not require the most recent data, and can be totally disconnected from the network when updates are not occurring. However, SQL Server provides a full range of choices for replication depending on application requirements. See "Replication for Microsoft SQL Server Version 7.0" for more information on replication capabilities in SQL Server 7.0.
Before the data can be loaded into the warehouse, it must first be transformed into an integrated, consistent format. A transformation is the sequence of procedural operations that are applied to the information in a data source before it can be stored in the specified destination. Data Transformation Services (DTS) is a new facility in Microsoft SQL Server 7.0 that supports many types of transformations, such as simple column mappings, calculation of new values from one or more source fields, decomposition of a single field into multiple destination columns, and many more.
Data Transformation Services Goals
Data Transformation Services was created to:
- Provide better importing, exporting, and transformation of heterogeneous data using OLE DB.
- Provide an extensible architecture accessible to ISVs, customers and consultants.
- Share rich metadata about the sources, destinations, transformations and lineage through integration with the Microsoft Repository Open Information Models.
Data Transformation Services allows the user to import, export and transform data to and from multiple data sources using 100 percent OLE DB-based architecture. OLE DB data sources include not only database systems, but also desktop applications such as Microsoft Excel. Microsoft provides native OLE DB interfaces for SQL Server and for Oracle. In addition, Microsoft has developed an OLE DB wrapper that works in conjunction with existing ODBC drivers to enable access to other relational sources. Delimited and fixed-field text files are also supported natively.
DTS transformation definitions are stored in Microsoft Repository, SQL Server, or COM structured storage files. Operational data sources are accessed using OLE DB, providing access to both relational and nonrelational data sources. The data pump opens a rowset from the source and pulls each row from the source into the data pump. The data pump executes ActiveX Script (Visual Basic®, Scripting Edition, Jscript® and PerlScript) functions to copy, validate, or transform data from the source to the destination. Custom transform objects can be created for advanced data scrubbing. The new values for the destination are returned to the pump and sent to the destination via high-speed data transfers. Destinations can be OLE DB, ODBC, ASCII fixed field, ASCII delimited files, and HTML.
Figure 6. DTS Architecture. Data is pulled from source systems with an OLE DB data pump and optionally transformed before being sent to OLE DB destinations.
Complex transformation and data validation logic can be implemented using an ActiveX scripting engine. These scripts can invoke methods from any OLE object to modify or validate the value of a column. Advanced developers can create reusable COM transformation objects that provide advanced scrubbing capabilities. Custom tasks can be created that transfer files by FTP or launch external processes.
ISVs and consultants can create new data sources and destinations by providing OLE DB interfaces. The data pump will query the OLE DB interface for any provider to determine whether high-speed data loading is support; if not implemented, standard loading mechanisms will be used.
Although standards like ANSI SQL-92 have improved interoperability between relational database engines, vendors still differentiate themselves in the marketplace by adding useful but proprietary extensions to ANSI SQL. SQL Server offers a simple programming language known as Transact-SQL that provides basic conditional processing and simple control of repetition. Oracle, Informix, and other vendors all offer similar, but incompatible SQL extensions. The DTS Transformation Engine's pass-through SQL architecture guarantees that 100 percent of the functionality of the source and destination are available to customers using the transformation engine. This will allow customers to leverage scripts and stored procedures that they have already developed and tested by simply invoking them from the transformation engine. The pass-through architecture dramatically simplifies development and testing, since DTS does not modify or interpret the SQL statement being executed. Any statement that works through the native interface of the DBMS will work exactly the same way during a transformation.
Data Transformation Services will record and document the lineage of each transformation in the repository so customers can know where their data came from. Data lineage can be tracked at both the table and row levels in a table. This provides a complete audit trail for the information in the data warehouse. Data lineage is shared across vendor products. The Data Transformation Services packages and data lineage can be stored centrally in Microsoft Repository. This includes transformation definitions, Visual Basic scripts, Java scripts, and package execution history. Integration with the repository will allow third parties to build on the infrastructure provided by the Transformation Engine. Execution of DTS packages can be scheduled for execution through an integrated calendar, executed interactively or in response to system events.
The DTS Package
The Data Transformation Services package is a complete description of all the work to be performed as part of the transformation process. Each package defines one or more tasks to be executed in a coordinated sequence. A DTS package can be created interactively using the graphical user interface or by any language that supports OLE automation. The DTS package can be stored in Microsoft Repository, in SQL Server, or as a COM structured storage file. Once retrieved from the repository or structured storage file, the package can be executed in the same way as a DTS package that was created interactively.
Figure 7. DTS packages can contain multiple tasks, and each task can be as simple as a simple table-to-table mapping, or as complex as invoking an external data cleansing process.
A task defines a piece of work to be performed as part of the transformation process, and a DTS package is composed of one or more tasks. A task can move and transform heterogeneous data from an OLE DB source to an OLE DB destination using the DTS Data Pump, execute an ActiveX Script, or launch an external program. Tasks are then executed by step objects.
The step objects coordinate the flow of control and execution of tasks in the DTS package. Some tasks must be executed in a certain sequence. For example, a database must be successfully created (Task A) before a table (Task B) can be created. This is an example of a finish-start relationship between Task A and Task B; Task B, therefore, has a precedence constraint on Task A. Each task is executed when all preceding constraints have been satisfied. Tasks can be executed conditionally based on runtime conditions. Multiple tasks can be executed in parallel to improve performance. For example, a package can load data simultaneously from Oracle and DB2 into separate tables. The step object also controls the priority of task. The priority of a step determines the priority of the Win32® thread running the task.
The DTS Data Pump is an OLE DB Service Provider that provides the infrastructure to import, export, and transform data between heterogeneous data stores. OLE DB is Microsoft's strategic data access interface and provides access to the broadest possible range of relational and nonrelational data stores. The DTS Data Pump is a high-speed, in-process COM server that moves and transforms OLE DB rowsets.
A transformation is the set of procedural operations that must be applied to the source rowset before it can be stored in the desired destination. The DTS Data Pump provides an extensible, COM-based architecture that allows complex data validations and transformations as the data moves from the source to the destination. The Data Pump makes the full power of the ActiveX Scripting languages available to the DTS package. This allows complex procedural logic to be expressed as simple, reusable ActiveX scripts. These scripts can validate, convert, or transform the column values using the scripting language of their choice as they move from the source, through the Data Pump, to the destination. New values can be calculated easily from one or more columns in the source rowset. Source columns also decompose a single field into multiple destination columns. ActiveX Scripts can also invoke and utilize the services of any COM object that supports automation.
Creating DTS Packages
DTS packages can be created using the import/export wizards, using the DTS Package Designer, or programmatically. The import/export wizards provide the simplest mechanism for moving data into or out of a data warehouse, but the transformation complexity is limited by the wizard's scope. For example, only single sources and single destinations are allowed in the wizard. However, the DTS Package Designer exposes all the capabilities of DTS through an easy-to-use, visual interface. Within the Package Designer, users can define precedence relationships, complex queries, flow of control, and access to multiple, heterogeneous sources.
Finally, applications can define and execute DTS packages programming through a COM interface. This approach is primarily used by ISVs who want to utilize the features of DTS without requiring a user to define the packages separately.
Figure 8. The DTS Package Designer provides a graphical environment for describing data flow and package execution.
Microsoft provides a number of mechanisms for querying information in the data warehouse. In the Microsoft Office suite of productivity tools, both Microsoft Access and Microsoft Excel offer facilities for query and analysis of information in a data warehouse. Included with SQL Server 7.0 is a component called English Query that allows users to query the database using straightforward English-language sentences. In addition, through the Data Warehousing Framework, many compatible products are available for sophisticated viewing and analysis of data.
Two of the most common tools used to access and manipulate data for decision support are Microsoft Access and Microsoft Excel. With the introduction of the next version of Microsoft Office, to be called Microsoft Office 2000, users will have many more facilities to analyze and present the information in their data warehouses. Microsoft Excel will be enhanced to enable tabular and graphical representation of OLAP data sources through the OLE DB for OLAP interfaces. At the same time, the existing PivotTable capability will be replaced by a more advanced OLAP facility based on the PivotTable Service component of Microsoft DSS. Microsoft Access is being enhanced to provide transparent support for SQL Server databases in addition to the existing Access database facilities. For customers, these new capabilities will allow their familiar desktop tools to be used for increasingly sophisticated analysis of data.
Microsoft Office 2000 will also include a number of components for simplifying the construction of Web-based applications using prebuilt controls. These controls will provide access to relational databases and OLAP databases, enabling widespread viewing of information in the data warehouse.
English Query is a feature of Microsoft SQL Server 7.0. English Query allows an application builder to create an application to the data warehouse that enables users to retrieve information from a SQL Server database using English rather than a formal query language like SQL. For example, you can ask, "How many widgets were sold in Washington last year?" instead of using SQL statements:
SELECT sum(Orders.Quantity) from Orders, Parts WHERE Orders.State='WA' and Datepart(Orders.Purchase_Date,'Year')='1996' and Parts.PartName='widget' and Orders.Part_ID=Parts.Part_ID
An English Query application accepts English commands, statements, and questions as input and determines their meaning. It then writes and executes a database query in SQL and formats the answer. English Query may also request additional information from a user if a question cannot be interpreted. English Query has a deep knowledge of language syntax and usage, but the application developer must create a domain of information about the data being made available to the user. In Microsoft English Query, a domain is the collection of all information that is known about the objects in the English Query application. This information includes the specified database objects (such as tables, fields, and joins) and semantic objects (such as entities, the relationships between them, and additional dictionary entries), and global domain default options.
The first step to building an English Query application is to model the semantics of the data warehouse. The developer maps the English language entities (nouns) and relationships (verbs, adjectives, traits, and subsets) to tables, fields and joins in the database. This is done using an authoring tool that allows for testing of domain outside the application.
Figure 9. The flow of information between a Web-based English Query application and a SQL Server database.
Once the domain is modeled sufficiently for user testing and access, the developer makes the English Query application accessible through a Visual Basic application or a Web-based implementation using Active Server Pages. With the increase of Intranet-based information delivery into data warehouses, English Query is an excellent tool for enabling access to users without costly query tools and without training.
At run time, an end user of an English Query application connects to a Web page through Microsoft Internet Explorer (or other Web browser), and enters a question. Microsoft Internet Explorer then passes the question to Internet Information Server (IIS), along with the URL of the Active Server Pages (ASP) page that executes VBScript.
The script passes the question to English Query for translation into an SQL statement. English Query uses domain knowledge about the target database (in the form of an English Query application) to parse the question and translate it into SQL. The script then retrieves this SQL, executes it (using an ASP database control), formats the result as HTML, and returns the result page to the user.
English Query includes sample ASP pages that can be used as delivered for rapid prototyping or customized to fit the look and feel of an existing Web application.
A fundamental philosophy of the Microsoft Data Warehousing Framework is the openness of the solution to third-party components. Through the ODBC and OLE DB database interface standards, dozens of products can access and manipulate the information stored in a SQL Server or other relational database. Likewise, the OLE DB for OLAP multidimensional database interface makes available the information in Microsoft Decision Support Services and other OLAP data stores. Because of these two access standards, organizations are able to select the most appropriate analytical tools for their needs. The reduced expenses for software vendors due to standardization will also mean that the costs of acquiring best-in-class products will diminish over time.
One of the most significant hidden costs of implementing a data warehouse is the ongoing maintenance and administration of the system. With conventional technology, specialized skills are typically required to manage the relational database, the OLAP server, and the design and transformation technology. This means multiple individuals with specific training are often needed to perform integral, related tasks. The Microsoft Data Warehousing Framework provides an integrated management and administration layer that can be shared across the components in the data warehousing process. Microsoft is providing a console for the Microsoft product lines that simplifies the transition from task to task, even between separate products. The Microsoft Management Console is extensible by customers, consultants and independent software vendors, providing for a highly customized interface for specific environments. Applications are delivered as a snap-in to the console, and can be either a packaged user interface developed by a software vendor or a customized interface developed separately, but accessing the capabilities of an underlying product such as SQL Server. Like much of the Microsoft BackOffice® suite, SQL Server 7.0 is delivered as a snap-in to the Microsoft Management Console.
Microsoft Management Console
The Microsoft Management Console (MMC) provides a consistent, familiar interface for accessing the capabilities of Microsoft's server products. MMC's user interface is similar to the Windows Explorer environment, with a vertically split work area containing the tree of categories and objects relevant to a particular server on the left side, and the right side providing additional details about a selected item. Detailed information in the right pane can be displayed in a variety of ways, including HTML documents. This new console allows for more sophisticated tools to assist the novice or infrequent database administrator. One significant addition is taskpads, which group together multifaceted activities such as building a database, establishing user security, and monitoring the SQL Server database. Taskpads combine tutorial information, guided activities and the linking of the many wizards described below.
Microsoft SQL Server 7.0 contains more than 25 wizards designed to simplify execution of frequent tasks, including:
- Creating databases, views, indexes, and stored procedures.
- Backing up or restoring a database.
- Configuring a publishing and a distribution server for replication and creating a publication for replication.
- Managing SQL Server database security.
- Creating a maintenance file that can be run on a regular basis.
- Defining full-text indexing on SQL Server character-based columns.
- Creating a Web task that creates an HTML page, imports data from an HTML page, or runs an existing Web task.
The use of wizards dramatically reduces the learning curve required for a database administrator to become productive with Microsoft SQL Server. In the data warehousing environment, where database administrators are often supporting many steps of the process with multiple products, this translates into savings of both time and money.
Visual Database Diagrams
Because data warehouse applications are more iterative than OLTP systems, the database structures and schemas tend to change more often. Visual Database Diagrams provide physical data modeling tools for Microsoft SQL Server database administrators, simplifying the definition and change cycles. The diagrams are stored on the database server using SQL Server 7.0 Enterprise Manager. Changes to the database or to the diagram are reflected in each other. A wizard is available that automates the selection and layout of tables in an existing database. However, database entities (tables and their relationships) can be defined entirely within the Diagram tool.
Figure 10. The Database Diagram shows tables and their relationships, and allows changes to the structure of individual tables and the constraints linking tables together. When the diagram is saved, the changes are made to the database itself.
SQL Server Profiler
Proper tuning of a relational database requires knowledge of the way the database is used on a regular basis. SQL Server Profiler is a graphical tool that allows system administrators to monitor engine events in Microsoft SQL Server by capturing a continuous record of server activity in real-time. SQL Server Profiler monitors events that occur in SQL Server, filters events based on user-specified criteria, and directs the trace output to the screen, a file, or a table. SQL Server Profiler then allows the database administrator to replay previously captured traces to test changes to database structures, identify slow-performing queries, troubleshoot problems, or recreate past conditions.
Examples of engine events that can be monitored include:
- Login connects, fails, and disconnects.
- SELECT, INSERT, UPDATE, and DELETE statements.
- An error written to the SQL Server error log.
- A lock acquired or released on a database object.
Data from each event can be captured and saved to a file or SQL Server table for later analysis. Data from the engine events is collected by creating traces, which can contain information about the SQL statements and their results, the user and computer executing the statements, and the time the event started and ended
Event data can be filtered so that only a subset of the event data is collected. This allows the database administrator to collect only the event data he or she is interested in. For example, only the events that affect a specific database, or those for a particular user, can be collected; all others ignored. Similarly, data could be collected from only those queries that take longer than a given time to execute.
SQL Server Profiler provides a graphical user interface to a set of extended stored procedures, which you can use directly. Therefore, it is possible to create your own application that uses the SQL Server Profiler extended stored procedures to monitor SQL Server.
SQL Server Query Analyzer
SQL Server Query Analyzer is an excellent tool for the ad hoc, interactive execution of Transact-SQL (the SQL Server query language) statements and scripts. Because users must understand Transact-SQL in order to use SQL Server Query Analyzer, this facility is primarily meant for database administrators and power users. Users enter Transact-SQL statements in a full-text window, execute the statements, and view the results in a text window or tabular output. Users can also open a text file containing Transact-SQL statements, execute the statements, and view the results in the results window.
SQL Server Query Analyzer also offers excellent tools for determining how Microsoft SQL Server is interpreting and working with a Transact-SQL statement. A user can:
- Display a graphical representation of the execution plan generated for the statement.
- Launch the Index Tuning Wizard to determine what indexes can be defined for the underlying tables to optimize the performance of the statement.
- Display statistics about the performance of the statement.
Figure 11. The SQL Server 7.0 Query Analyzer shows graphically how complex queries are resolved. In this example, portions of a query are parallelized for performance improvement.
Index Tuning Wizard
One of the most time-consuming and inexact processes managing a relational database is the creation of indexes to optimize the performance of user queries. Microsoft Research has collaborated with the SQL Server development organization to develop a facility that simplifies this task. The Index Tuning Wizard is a new tool that allows a SQL Server database administrator to create and implement indexes without an expert understanding of the structure of the database, hardware platforms and components, or how end-user applications interact with the relational engine. The Index Tuning Wizard analyzes database workload and recommends an optimal index configuration for the SQL Server database.
The SQL Server Index Tuning Wizard has the ability to:
- Compare, contrast, and select the best mix of indexes using the Showplan cost formula.
- Recommend the best mix of indexes for a workload (trace file or SQL script) against a database.
- Provide index, workload, table use, and query cost analysis.
- Allow the system administrator to tune the database for a small set of problem queries without changing the index configuration.
- Allow the system administrator to prototype index configuration recommendations for different disk space constraints.
The Index Tuning Wizard can be used to analyze an SQL script or the output from a SQL Server Profiler trace and make recommendations regarding the effectiveness of the current indexes that are referenced in the trace or SQL script. The recommendations consist of SQL statements that can be executed to drop existing indexes and create new, more effective indexes. The recommendations suggested by the wizard can then be saved to an SQL script to be executed manually by the user at a later time, immediately implemented, or scheduled for another time by creating a SQL Server job that executes the SQL script.
If an existing SQL script or trace is not available for the Index Tuning Wizard to analyze, the wizard can create one immediately or schedule one using SQL Server Profiler. Once the database administrator has determined that the created trace has captured a representative sample of the normal workload of the database being monitored, the wizard can analyze the captured data and recommend an index configuration that will improve the performance of the database.
Data warehouse administrators can benefit significantly from automation of routine tasks, such as performing database backups. SQL Server administrative tasks can be automated by establishing which will occur regularly and can be administered programmatically, then defining a set of jobs and alerts for the SQL Server Agent service. Automated administration can include single or multiserver environments.
The key components of automated administration are jobs, operators and alerts. Jobs define an administrative task once so it can be executed one or more times and monitored for success or failure each time it executes. Jobs can be executed on one local server or on multiple remote servers; executed according to one or more schedules; executed by one or more alerts; and made up of one or more job steps. Job steps can be executable programs, Windows NT® commands, Transact-SQL statements, ActiveScript, or replication agents.
An operator is an individual responsible for the maintenance of one or more servers running SQL Server. In some enterprises, operator responsibilities are assigned to one individual. In larger enterprises with multiple servers, many individuals share operator responsibilities. Operators are notified though e-mail, pager or network messaging.
An alert is a definition that matches one or more SQL Server events and a response should those events occur. In general, an administrator cannot control the occurrence of events, but can control the response to those events with alerts. Alerts can be defined to respond to SQL Server events by notifying one or more operators, forwarding the event to another server, or raising an error condition that is visible to other software tools.
Through the combination of notifications and actions that can be automated through the SQL Server Agent service, administrators can construct a robust, self-managing environment for much of their day-to-day operational tasks. This frees administrators to tend to more important tasks that require their personal attention.
Through the technologies underlying the Microsoft Data Warehousing Framework and the significant advancements in Microsoft SQL Server 7.0, Microsoft is working to reduce the complexity, improve the integration, and lower the costs associated with data warehousing. Customers investing in data warehouse technology based on the Microsoft platform can be assured that they are creating applications with the best possible economic considerations, while maintaining full scalability and reliability of their systems.