This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.
SQL Server as a Business Development Platform
Using the Microsoft Repository
Leverage Repository Technology to Give Users the Best Business Information Possible
Editor's Note: "Using the Microsoft Repository"
and "The Open Information Model," page 42, by Patrick Cross and
Saeed Rahimi, and "Incremental Updates in OLAP Cubes," page 47, by
Claude Seidman, continue our series on data warehousing and
Incremental Updates.
In recent years, data warehousing best practices have included
the use of a repository to store information about the warehouse.
The repository information helps people assess the impact of
changes, aids in tracking down problems, and helps warehouse users
better understand the data they use to make key business decisions.
The Microsoft Repository is an important part of Microsoft's data
warehousing strategy. By following these examples, you'll understand
and use the repository more effectively.
What Is a Repository?
A repository
is a storage place for meta data, or data about data. Several
vendors have created repository products in the past 20 years,
starting with data dictionaries in the 1960s. More recent products
have evolved from these simple dictionaries into complex systems for
storing data from hundreds of applications and environments. Most
recently, Microsoft included its repository technology as part of
its data-warehousing framework (a collection of tools provided with
SQL Server), in recognition of the important role repositories play
in warehousing. Microsoft uses Extensible Markup Language (XML) to
integrate support for exchanging information in the repository. XML
provides a simple way for tools to interchange meta data. Interest
in repositories has recently surged, primarily because of the
emergence of data warehousing, knowledge management, and enterprise
application integration.
What Is Meta Data?
According to
Bill Inmon, meta data is "the description of the structure, content,
keys, indexes, etc. of data," (Managing the Data Warehouse,
John Wiley & Sons, 1996). More specifically, in a data-warehouse
environment, meta data can be information about warehouse data,
information about how to get a piece of data out of the warehouse,
or information about the quality of data in the warehouse. Meta data
can even give information about how to run warehouse tools to
perform different tasks. Meta data about a warehouse includes
information about systems, processes, source and target databases,
data transformations, data cleansing, data access, data marts, and
OLAP tools.
For example, a warehouse's meta data can tell you how an OLAP
tool calculated a report's Total Sales column. Without knowing
anything else about the warehouse, a user looking at a regional
sales report might think that Total Sales includes all discounts and
shipping and handling charges. But Total Sales might not include any
or all of these elements. To help the end user, the warehouse needs
meta data that ties the reports' columns to data transformations,
data queries, field calculations, and source database tables and
columns. The warehouse can provide this help if it has meta data
about the source databases (such as tables and columns), what
information was extracted, how the information was transformed, the
target (or warehouse) database, data warehouse reporting tools, and
database design and modeling tools.
Many warehouse implementations today use a simple spreadsheet to
capture the source and target mappings and conversions. But beyond
the project's initial requirements phase, this information quickly
becomes outdated and inaccessible to most users who interact with
the warehouse.
Why Is Meta Data Important?
Meta
data acts as a road map to the information in the warehouse. It
helps users find out what type of data is in the warehouse, what a
piece of data means, how to access the data, who owns the data, and
who created it and when and how. Without meta data, data warehouse
users can access data but not information in context that helps them
make business decisions with confidence. Currently, the best
practices for warehouse and data-mart development mandate having a
meta data strategy that makes the warehouse easy to update and use.
Meta data can provide complementary information about the
warehouse's contents. For example, an analyst can use meta data to
better understand what information is available and how it is
calculated. Meta data provides a detailed analysis of where the
information came from and can give a confidence factor for
describing the data's validity.
Meta data can also maintain a history of activity before the
current processes. This history is important in understanding how
the structure and processes change over time. If a user wants to
compare sales by product category during the past 2 years, but the
product category was added this year, the results of the comparison
are meaningless.
Also, meta data provides a common frame of reference, acting as
the organization's central knowledge base. For example, everyone who
looks at the regional annual sales report can arrive at the same
understanding of what Total Sales means.
The ability to analyze the impact of changes in the warehouse can
drastically reduce the cost of ongoing maintenance. If you need to
change a source table, you can simply look at the table's other
columns to determine which parts of the warehouse use those columns.
With this usage information, you can make a quick list of items that
you'll need to modify to accommodate the source system changes.
Likewise, you can more easily fulfill requests for additional
information in the warehouse if you can see where information is
already coming from. Before adding new data to a table, review all
the data sources for the table.
Meta Data Services
Repository
technology is the core of the services that SQL Server provides for
managing meta data. These services' features bring significant value
to information consumers and people who maintain the repository
data. The most important features are information models, the
Microsoft Repository engine, XML interchange, and extensibility.
Information models. Information models define the
meta data in terms of object types and their relationships. The
information model is thus the language for describing the
information the repository will store. Microsoft's data-warehousing
consortium (for more information, see "The Open Information Model,"
page 42) chose the Unified Modeling Language (UML) for documenting
and communicating the information model the Microsoft Repository
uses. Although users can define their own information models to
describe meta data, Microsoft created the Open Information Model
(OIM) as a common specification for storing information about
systems, warehouses, etc.
Repository engine. The Microsoft Repository engine
handles meta data storage and retrieval. The engine uses the
information model and stores meta data instances that the
information model describes in SQL Server tables and columns. And
the engine uses caching to optimize access to the meta data.
Versioning is an important feature of the Microsoft
Repository, and is fairly new to repositories generally. Versioning
means that as updates happen to objects in the repository, those
changes are captured, and the repository maintains a history of how
the repository data changed. By letting you view objects as they're
updated, the repository can answer questions about changes over
time. For example, if new columns are added to the warehouse, you
can track and query this addition later. This ability helps you
solve problems related to warehouse alterations that cause sudden
changes to data the users receive.
XML interchange. Besides describing the data types
that the repository can store, the information model includes an XML
format for interchanging meta data between the repository and other
tools. And users can easily import information into the repository
from custom sources or other tools. One example in "Sample
Repository Applications" (which you can download at the link to this
article at http://www.sqlmag.com) contains a sample XML file that
you can use to load data into the repository.
Extensibility. Extensibility is an important
repository feature that lets users add information to the models to
cover topics specific to their organization or tool. Using the
Microsoft Repository software development kit (SDK), which is
available at the Repository Web site (for this and other resources,
see "For More Information"), you can extend the information models
that the Repository uses. If your organization needs to store
information about locations that aren't part of the base OIM, you
can extend the model to add a new object and corresponding
relationships to track that information. Sites can also add new
properties to track information for existing objects (e.g., adding a
new attribute to a table).
Integration with the Microsoft
Data-Warehousing Framework
The Repository is a component
of Microsoft's data-warehousing framework and an important part of
its SQL Server strategy. Microsoft adds value for its database
customers by bundling warehouse tools with SQL Server 7.0, such as
Data Transformation Services (DTS), OLAP services, English Query,
and the Repository. You can also access repository data through the
SQL Server Enterprise Manager, as Screen
1 shows, by clicking the Metadata node under Data Transformation
Services. From there, you can access the relational schema
information and package and lineage information. You can explore the
relationships between columns and the packages that update them.
Data Transformation Services. DTS is the
extraction, transformation, and loading (ETL) tool that comes with
SQL Server 7.0. Its flexibility and range of features help you
populate a warehouse. You can save packages directly into the
Repository. One of DTS's more compelling features, lineage, requires
the Repository. Lineage lets you track how data in the warehouse was
calculated and when it entered the warehouse.
To save DTS information into the Microsoft Repository, choose SQL
Server Repository as the location for saving the package. Use the
Advanced tab on the Package Properties to set the scanning options,
which Screen
2 shows. Doing so causes DTS to call the OLE DB scanner to load
all source and target catalogs into the Repository. If you don't set
the scanning options, DTS creates DTS Local Catalogs as the
reference for all source and target catalogs, which can make
locating the databases impossible. Each subsequent save replicates
this reference, so you can't keep comments and other descriptive
information updated.
You can run into problems when you try to save certain DTS
transformations to a repository. If you use a script to perform a
simple transformation and you choose the source columns explicitly
(not from a query), all the transformation data is captured, as you
can see in the transformation model in "The Open Information Model."
If you choose a query as the transformation source, that source
becomes objects that aren't part of the OLE DB imported data. This
choice makes following the connection back to the true source
objects difficult. Also, the query isn't parsed to create a
connection between the query columns and the columns you select the
data from. So in many cases, the connection between source and
target is available, but in some, it isn't. You can solve these
problems by writing a program to resolve the references in a
repository or by using a custom model along with the DTS model to
store the source target mappings.
DTS also uses versioning for the package object, but it
replicates all the subordinate objects with each save. You can then
go back to any version of a package to see exactly how the data was
transformed. This capability is important for tracking down problems
with data months (or years) after it entered the warehouse. This
versioning scheme won't cause many problems if you use the
appropriate scanning options to import the relational schemas, which
will then be appropriately versioned for continuity.
OLAP Services. OLAP Services gives you
multi-dimensional analysis for warehouse data. By using a utility in
the Repository SDK, you can import all the definitions for the OLAP
data into a repository. With SQL Server 2000, OLAP Services stores
meta data directly in the repository. One problem with the utility
in SQL Server 7.0 is that it doesn't connect the measures and
dimensions to the underlying columns they're based on. Making these
connections can be a tedious manual task, or you can write a program
to make the connections automatically. The information about which
columns are used is available in the repository model, so you can
easily retrieve the data.
English Query. English Query lets you define a
semantic model for a database, then translates English phrases into
SQL. The semantic model provides information about a system and how
it is used, and can be valuable as you use the warehouse. The
Repository SDK contains a utility to import data from English Query
into a repository. SQL Server 2000 has an option within English
Query to export and import models from a repository.
OLE DB (relational schema). Importing relational
schemas into a repository gives you the base set of information
necessary to begin documenting the warehouse. You can import
database schemas from any OLE DB- or ODBC-compliant data source. To
run the import from Enterprise Manager, right-click the Metadata tab
under Data Transformation Services and choose Import Metadata. This
import utility uses full versioning as it loads the data, so you can
easily understand the changes to the database schemas over time.
Because the utility uses versioning, it preserves all the
descriptive information, such as comments, and rescanning the
catalog doesn't affect the information.
Example Usage Scenarios
The
following usage scenarios demonstrate how end users and warehouse
analysts can benefit from the data in the Microsoft Repository. For
most organizations, the end users benefit most from the assurance
that they have the appropriate information and an understanding of
the data they're viewing.
Following a data path from OLAP cube to source
system. A business user is using an OLAP client to view data
and needs to better understand how the data was calculated and where
it came from. The user is familiar with the company's current
operational systems, but isn't familiar with the warehouse. The
organization has undergone numerous mergers and acquisitions, so
several systems that populate the warehouse originated in different
companies.
The user starts with the OLAP client that is displaying sales
data from a cube. To get more information about what the data means
and where it came from, the user could obtain from the Repository
descriptive information about the cube and its measures and
dimensions. This description would be more informative than the
simple labels assigned to each item. For Total Sales, the
description might note that returns don't appear until the
end-of-quarter processing, or that the regions changed in 1997, so
region comparisons before that year will return invalid results. The
user can then follow the path from a particular measure or dimension
to the warehouse, then through the transformations to the source
systems. If users are interested in data from a specific source
system, they can use the Repository to determine which cubes or
reports display data from that system. Descriptions of the source
systems, the transformations that moved the data, and related
information are also available. This information gives users a frame
of reference in terms of the current business, not just the
warehouse.
Using DTS lineage to find source data and
transformations. In another case, a user is viewing data
from a report or cube and wants to know exactly where the data came
from. This environment has seven source systems, each of which uses
a different package to populate the warehouse. Using the lineage
feature of DTS, the user can determine which package a system used
to populate the warehouse with a given row. From a particular report
row, the user can determine the lineage IDs for that warehouse data,
then query the Repository to find out which packages from which
source systems populated that data. This process gives the user a
detailed analysis of the data source. An experienced warehouse
analyst can also use this information to see the exact
transformation package that populated that data, even if the package
changed after populating the data.
Make the Best Decision
The
Microsoft Repository and its integration with other warehousing
tools give you the kind of information that's impossible to obtain
in a typical warehousing environment, which stores designs and
transformations primarily in spreadsheets. The goal of warehousing
is to give as much information to as many users as possible, so that
they can make business decisions based on the best information.
But without a means to understand and describe that data, users
might get too much information or not trust the data and could base
decisions on a faulty interpretation of the data's meaning.
Repository technology can play a significant role in the delivery of
warehouse information to end users, supplying consistent
descriptions and a road map of the data being viewed. Microsoft and
other vendors provide significant integration with data warehousing
tools to make meta data easier to use.
Bugs, comments, suggestions | Legal | Privacy | Advertising
Copyright © 2002 Penton Media, Inc. All rights reserved.