Just What Are Cubes Anyway? (A Painless Introduction to OLAP Technology)
Carl Dubler and Colin Wilcox
Microsoft® Excel 2002
Summary: This quick and easy introduction to OLAP databases shows you the differences between traditional Online Transaction Processing (OLTP) databases and Online Analytic Processing (OLAP) databases and how to access OLAP databases from your Office solutions. (8 printed pages)
Chances are your business uses at least one database, and probably more. The databases store information about business transactions, plus other data such as employee records. Those types of systems are called online transaction processing (OLTP) databases.
You may not know it, but your OLTP data contains a wealth of information that can help you make informed decisions about your business. For example, you can calculate your net profits for last quarter and compare them with the same quarter of the previous year. It can also provide other types of valuable information such as which employees are the most and least productive, and the optimum levels of goods to keep in stock.
The process of analyzing your data for that type of information, and the data that results, are collectively called business intelligence. Typically, business intelligence tries to answer the following types of questions:
- What were the total sales of all products last year?
- How does our profitability for the first quarter of this year compare to the same time period during the past five years?
- How much money did customers over age 35 spend last year, and how has that behavior changed over time?
However, you can spend a lot of time and money trying to extract business intelligence information from your database. Some organizations use a small army of data professionals and perhaps a dozen different software packages to produce simple reports. Also, if the report doesn't have the proper information, its creators have to start over.
The time and expense involved in retrieving answers from databases means that a lot of business intelligence information often goes unused. The reason: most operational databases are designed to store your data, not to help you analyze it. The solution: an online analytical processing (OLAP) database, a specialized database designed to help you extract business intelligence information from your data.
The following sections provide an overview of both types of databases. From there, we will look at OLAP databases in more detail.
Most businesses use (OLTP) databases to gather and store the records generated by their daily operations. Typically, (OLTP) databases execute transactions, meaning that they add, update, or delete groups of records at the same time. For example, the database for a grocery store inserts and updates information about prices, purchases, and costs of goods and freight, and it usually does so at lightning speed. After all, you don't want your customers to wait in line while your inventory system updates its stock and pricing tables.
However, the design that allows OLTP databases to record transactions quickly and accurately also makes it hard to analyze their data for several reasons. First, OLTP databases contain a large number of tables, sometimes hundreds. Those tables often have multiple relationships with other tables in the database. That complexity can make it hard to understand the database and know where to look for data.
The following figure depicts some of the tables and relations that exist in the Northwind sample database provided by Microsoft® SQL Server™ 2000:
Figure 1. Part of the Northwind database schema
Second, if you try to extract (OLAP) data from an OLTP database, you usually need to create and run stored procedures—groups of SQL statements compiled into a single execution plan. Stored procedures can take hours to run, and they can slow the down the production database, something you don't want to do with a live system. (Remember the whole "customers waiting in line" thing? You don't want that to happen.)
Third, during normal operations, OLTP databases constantly update their data. Trying to analyze changing data is, well, like trying to analyze changing data. You will always have a hard time obtaining an accurate result, assuming you can obtain one at all.
Finally, OLTP databases usually store individual records. For example:
On April 2, John Smith bought a case of apples from Jane Doe for $5.00.
That type of storage poses a problem for analysts because they use summarized data—totals and subtotals—to help answer business intelligence questions. Individual records don't help them at all.
In other words, you need a system that extracts data from your OLTP database, aggregates it into totals and subtotals, and then displays the resulting data in a way that allows you to spot past successes and failures, and to identify potential future successes and failures.
The solution to that problem is called an Online Analytical Processing (OLAP) database.
OLAP and OLTP databases differ in several respects. First, IT departments usually keep OLAP databases isolated from OLTP databases. Doing so ensures that the transaction database performs well, and that the OLAP database only receives historical business data. In addition, while the data in an OLTP database constantly changes, the data in an OLAP system never changes. Users never perform data-entry or editing tasks on OLAP data. All they can do is run mathematical operations against the data.
Second, OLAP databases use fewer tables and a different type of schema. For example, an OLAP database typically uses between five and 20 tables. In addition, they usually keep the number of joins to a minimum by arranging tables a star schema. The following figure depicts how part of the Northwind database could look when converted to a star schema:
Figure 2. A star schema
The central table in the schema is the fact table. Fact tables contain numeric data, such as zip codes, and additive data such as the total costs of freight for all beverages.
By themselves, numeric facts do not have much meaning. For instance, the number 206 by itself does not mean much. However, it takes on more meaning if you know that it represents an area code or the number of dishwashers sold yesterday. In a star schema, dimension tables contain the descriptive text that gives meaning to the numbers. Keep in mind that most analyses involve time, which makes time itself a key dimension.
The facts in a dimension are called members. By design, OLAP databases group the related facts in a member into hierarchies whenever the underlying data supports that type of structure. For example, the Time dimension in the preceding figure contains the following hierarchy:
- Order Date
Hierarchies use traditional parent/child relationships. For instance, Quarter is a child of Year, Month is a child of Quarter, and so on. If a child contains data that your OLAP system can aggregate, its parent level contains those aggregated sums. Some systems call those aggregated sums rollups. Whenever you drill up or down through your data, you navigate through those hierarchies.
The joins between the dimension and fact tables allow you to browse through the facts across any number of dimensions, as well as up and down any number of hierarchies. For example, you might query for:
Total sales and total costs for all beverages purchased in 1999 by customers in Colorado.
Total sales and total costs for beer purchased in 2000 by customers in Colorado.
The second query, of course, takes data from different hierarchy levels in the Time and Product dimensions.
The simple design of the star schema makes it easier to write queries, and they run faster. For example, running the total sales and costs query against an OLTP database could involve dozens of tables, making query design complicated. In addition, the resulting query could take hours to run.
Third, OLAP databases make heavy use of indexes because they help find records in less time. In contrast, OLTP databases avoid them because they lengthen the process of inserting data.
Now that you have an OLAP server and a star schema, you're ready to go, right? Well, no. Remember, IT departments deliberately isolate OLAP and OLTP databases. You need a way to move the data to the OLAP database, combine that data into useful aggregations, and then populate the tables. That process is often called Extract, Transform, and Load (ETL). SQL Server has a built-in utility called Data Transformation Services (DTS) that performs the ETL tasks.
You typically use DTS to populate your OLAP schemas, and then automatically update your data. The update interval depends on your business, and the types of answers you want from your data.
OK, now you're ready to produce killer reports, right? Sorry! Even though they use a simplified data structure, star schemas are sometimes too complicated for some analysts to understand. In addition, OLAP databases can contain the same type of information found in your OLTP databases:
On April 2, John Smith bought a case of apples from Jane Doe for $5.00.
In other words, you still don't have the aggregated data you need to answer your questions. Can you ever win?
Data cubes provide the final piece of the puzzle. A cube aggregates the facts in each level of each dimension in a given OLAP schema. The business intelligence industry uses the word "cube" because it best describes the resulting data. For example, let's consider our star schema. When you create a cube from that schema, you take the freight, quantity, discount, and other facts and add them up by city, by year, by city and year, and by every other possible combination of dimension and hierarchy level. Those calculations produce the following type of data structure:
Figure 3. In other words, a cube
Note Data cubes are not "cubes" in the strictly mathematical sense because they do not have equal sides. However, virtually all analysts use the term, and it is an industry standard.
Here is where things get really exciting. Because the cube contains all of your data in an aggregated form, it seems to know the answers in advance. For example, if a user asks for total sales by year and city, those numbers are already available. If the user asks for total sales by quarter, category, zip code, and employee, those numbers and names are already available. If it helps you to understand them, think of cubes as specialized small databases that know the answers before you even ask the questions.
That is the big advantage of a cube. You can ask any pertinent question and get an answer, usually at warp speed. For instance, the largest cube in the world is currently 1.4 terabytes and its average response time to any query is 1.2 seconds! In addition, you can view cube data with any valid tool, including spreadsheets, Web pages, the Cube Browser in Analysis Services 2000, or graphic data browsers such as Microsoft Data Analyzer.
You can use a variety of tools to build cubes, including Microsoft Excel, Analysis Services 2000 (which comes with Microsoft SQL Server 2000), or OLAP Services 7.0, the predecessor of Analysis Services.
The steps in this section explain how to use Excel to create a connection to a data source and build a cube. The steps create a local cube using data from the FoodMart 2000 Microsoft Access database (.mdb), which comes with SQL Server 2000 with Analysis Services.
The process of creating a cube takes place in several discrete phases:
- Chose a data source
- Create the query that extracts data from the database
- Create the cube from the extracted data
To select a data source
- Open Excel 2002.
- On the Data menu, point to Import External Data, and then click New Database Query.
- In the Choose Data Source dialog box, click the Databases tab, select New Data Source and then click OK.
- In the Create New Data Source dialog box, enter a name for the data source in the first text box, and then select a driver for the data source from the list, and then click Connect.
- In the ODBC Microsoft Access Setup dialog box, click Select.
- In the Select Database dialog box, navigate to the foodmart 2000.mdb file, and then click OK.
Note By default, SQL Server places the foodmart 2000.mdb file at C:\Program Files\Microsoft Analysis Services\Samples\foodmart 2000.mdb.
- Click OK twice more to return to the Choose Data Source dialog box.
To create the query
- In the Choose Data Source dialog box, select the data source you created in the previous set of steps. Make sure Use the Query Wizard to create/edit queries is selected and then click OK.
- In the Query Wizard—Choose Columns dialog box, select the columns of data you want in your cube. Typically, you include the data from at least one fact table to provide measures for your cube, and data from one or more dimension tables, including a time dimension. In the Available tables and columns list, click the plus sign (+) to expand the table, select a column, and then click the 'greater than' angle bracket (>) to move the column into the Columns pane in your query list box. For this example, select the following tables and columns:
Table Columns sales_fact_1998 store_sales, store_cost, unit_sales time_by_day the_date Product brand_name Product_class product_category, product_subcategory customer country, state_province, city, lname Store store_country, store_state, store_city, store_name
- Click Next and then click Next in the next two screens.
- In the Query Wizard—Finish screen, select Create an OLAP Cube from this query and click Finish. This launches the OLAP Cube Wizard, and you use the wizard to build your cube.
To create the cube
- Click Next in the Welcome to the OLAP Cube Wizard screen. In step 1 of the wizard, select store_sales, store_cost, and unit_sales in the Source field column. In the Summarize by column, select Sum for each field, and then click Next.
- In step 2 of the wizard, drag the_date from the Source fields box to the Dimensions box. Right-click the_date, click Rename, and enter Time as the name of the dimension. Clear the Week and Day check boxes under the dimension name.
- Drag product_category, product_subcategory, and brand_name so that they appear in that order, in the next available dimension. Rename the dimension to Product.
- Drag country, state_province, city, and lname so that they appear in that order, in the next available dimension. Rename the dimension to Customer.
- Drag store_country, store_state, store_city, and store_name so that they appear in that order in the next dimension. Rename the dimension to Store. Click Next.
- Select the option that best fits the type of cube you want to create. For our example, select Save a cube file containing all data for the cube. Enter a path and filename for the cube, and then click Finish.
- In the Save As dialog box, enter a filename for the query definition that you just created and click Save. Saving the query definition allows you to reuse it later. The file is saved with an .iqy filename extension.
- The Cube Wizard creates the cube file. This may take several minutes. Once the Cube Wizard creates the cube, the PivotTable and PivotChart Wizard—Step 3 of 3 screen appears. Use the screen to create a Microsoft PivotTable® report from the data in the cube you just created. Use the Options and Layout buttons to configure the report, or click Cancel to exit the wizard.
Finally, stay tuned. We will have a lot more to say about business intelligence and OLAP in future articles.
Carl Dubler is a Technical Specialist in the Microsoft Denver, Colorado office. He specializes in helping customers understand Microsoft database and analysis technologies.
Colin Wilcox is a Technical Writer for the Microsoft Office team. He specializes in helping end users understand Data Analyzer and related computer technologies.