Skip to main content

Thinking Global BI: Data-Warehouse Principles for Supporting

Enterprise-Enabled Business-Intelligence Applications

Charles Fichter
Microsoft Corp.

December 2009

 

Summary: This article focuses on the design principles to support a global data-warehouse (DW) architecture, the golden triumph of any successful business-intelligence (BI) application story. It draws from the Microsoft Global independent software-vendor (ISV) partner experience in designing enterprise BI applications by using Microsoft platform technologies and contains external links and references to public content that delves deeper into the design topics that are covered.

This article assumes that the reader has some basic DW understanding of a dimensional store, the underlying fact tables in which columns are known as measures, dimension tables in which columns are known as attributes, and how schemas take on star and snowflake patterns. There are many available resources to provide this overview; however, if needed, a concise overview can be found here: http://www.simple-talk.com/sql/learn-sql-server/sql-server-data-warehouse-cribsheet. This article focuses also on successful DW project strategies and advanced topics of effective design for performance.

 

Introduction

Architects who are looking to solve enterprise BI solutions are often enticed by packaged software applications that are able to fulfill executive requests for effective reports that reveal deep analysis of business performance. Microsoft and its vast array of ISV partners have made significant inroads into fulfilling a vision for easing the burden of generating the BI dashboards that all executives dream of—providing them with up-to-the-minute results of their business strategies and the ability to drill down into specific areas. Too often left unsaid, however, is the larger 90 percent effort that is left to the supporting architect and IT force behind the glamorous UI: how to get the data; scrub and aggregate it effectively; and design appropriate, manageable, and performant dimensional stores, including ad-hoc query support, remote geography replication, and even data marts for mobile decision-maker support with ever- increasing volumes of dimensional data.

 

Understanding Isolated Enterprise Data, and Accessing It

Enterprise architects who are looking to aggregate application data stores into meaningful Multidimensional Online Analytical Processing (MOLAP) dimensional models are often faced with many internal obstacles to accessing source data. These obstacles are often less technical and more business-, legal-, audit-, or security- sensitive; or overhead is too restrictive, project process, or even political, as business data can represent “turf” among executives and divisions. Some of the obstacles are technology constraints such as noncompatible or proprietary solutions, legacy file formats, and nonrelational or unstructured data. But as vendor tools (especially enhancements in Microsoft SQL Server 2008, particularly with Microsoft SQL Server Integration Services [SSIS] capabilities) and service oriented–architecture (SOA) technologies advance (for example, adoption of WS* and other open connectivity standards), this is becoming far less of an issue.

However, many BI projects are stalled and/or eventually killed because of a failure by the team to understand accurately what data was required, and how to access it successfully and make it usable. Usability is a key concept. How do you take a dozen columns (with names such as “xtssalescongproc”) and consolidate them in a central fact table that has readable column names, so that end users can leverage self-service BI technologies in the future?

The following are a few general tips to help avoid the pitfalls of navigating access to isolated data:

  1. Establish strong executive sponsorship early. The success of your project will be determined by how deeply and broadly across enterprise stores you have executive mandate. Asking for access is merely 1 percent of the effort. You might be incurring significant time and costs across divisions—often, potentially affecting their service to customers to grant the analysis, access, and/or aggregation that you might be asking of them. In addition, does that division truly understand their own data? How much time are you asking them to analyze and assess even what data and capacity they have to provide for you? Do not underestimate the importance of executive sponsorship or the potential cost of time and resources that you might be asking across other high-value data stores and the people who manage them.
  2. Does anyone truly know the data? This might seem like an obvious question; but, as we have done more of these efforts, it never ceases to surprise how little enterprise customers often know about their own data. Many ambitious BI projects are halted quickly, with a realization that first a project team must perform a full analysis of all enterprise data stores, which can often take months. Simply looking at tables can befuddle architects who are new to the data source, as column names do not inherently describe the data that they contain. Often, applications and stores have merely been maintained and not enhanced over the years, and the intent and design of data stores is tribal knowledge that was lost long ago. Can you effectively look at a 500-plus-table database and understand every relationship without understanding the minutiae of every application that utilizes the store? Using advanced vendor tools and ample time, perhaps. The devil is in the details, and the strength of your dimensions and attributes later depends on your understanding of the raw data sources that are at the base of your aggregation.
  3. Understand local priorities, and consolidate.The highest BI reporting demands are often local/regional in nature (by country/trading domain), which begs the question: Do you truly need a giant, aggregated DW store immediately? Or can you more effectively build a distributed warehouse and BI strategy, focusing on local/regional needs along with corporate? This is explored in more depth in the next section, as there might be significant cost savings, less network saturation, and performance benefits to a decentralized approach.

 

Consider Approaching Data Warehousing in Phases

Many BI and DW projects begin with ambitions to design and build the world’s greatest aggregated, dimensional store ever, with the intention of replicating subsets of the dimensional stores to geographies (the notion of data marts). However, is this approach always necessary or even feasible? Nearly every DW and BI project underestimates the investment of time and resources that it takes to design aggregation and scrub, build, and/or replicate data to independent MOLAP stores to support global enterprise needs.

The maturity and skill of each enterprise to deliver BI solutions can vary greatly, and breaking up the effort into smaller, agile-based deliveries can help your teams gain the experience and expertise that are needed to understand how to deliver against the larger, longer- term objectives. Remember the 80/20 rule: Most of the time, you can deliver 80 percent of the required features from 20 percent of the effort. Consider a simplified approach in phases, as shown in Figure 1.


Figure 1. Proposed global BI/DW phases (Click on the picture for a larger image)

 

Phase 1

If BI reporting needs are infrequent enough, leave as much source data in place, and build multidimensional views through a distributed caching strategy to deliver a subset of business solutions quickly. Can you merely query efficiently by using advanced vendor tools against existing application stores, even across competing platforms? This is an important decision point that you will face early on. This approach can address BI concerns such as, “Tell me what is happening in my business now.” This strategy can be most effective with noncompatible vendor technologies, or when strong divisional boundaries exist between data stores. It is far more efficient (in terms of person-hour commitment from your team), and you will be able to deliver solutions quicker to the business units.

While this might result in longer wait time for users, using tools such as SSIS can help you build packages to retrieve aggregate and clean large quantities of data (even from non-Microsoft technology stores), build the dimensional relationships in memory cache, and present them to requesting applications via Microsoft SQL Server Analysis Services (SSAS). In the past, this could be practical only for relatively small quantities of data; however, database vendor optimizations are making this scenario a real choice.

Phase 2

Whenever, possible, build and keep the dimensional stores locally. With large volumes of aggregated data, companies can begin more intense mining to address BI concerns such as, “Show me historical performance.” To begin here, it is essential that you critically analyze how much aggregate data replication truly is needed to a centralized store. Can you more effectively populate local MOLAP stores to represent regional business-analysis needs? When you examine regional needs versus larger corporate needs, you will likely find that more of the deeper, drilldown BI reporting needs are local/ regional in nature. Enterprise-wide BI queries tend to be broader, trend based analysis that can be supported by summary aggregations from smaller dimensional stores. Cheaper, inexpensive MOLAP stores can be effectively maintained by local resources and greatly reduce the complexity of a central warehouse design, as well as mitigate potentially large network congestion and replication needs.

Consider beginning your efforts in a smaller, less-strategic division of the company or geography to test your team’s ability. This design approach is almost an inverse of the traditional DW and downstream data-mart approach: Instead, the smaller, regionalized MOLAP stores become aggregation feeds to a larger, further aggregated, summary DW store for broader trending analysis. Although business trends are pushing highly globalized patterns, the need for in-depth regional mining is increasing, too; and relying solely on a centralized DW pattern can require a massive investment in both physical and people resources to maintain and might prove overly costly, fraught with performance challenges, and overly cumbersome to enhance or change.

Phase 3

Build and populate a traditional, independent, centralized DW of your dreams to reach all of the more ambitious BI needs of your company. This approach will address the harder BI concerns such as the ever-elusive BI goldmine, “Predict future results,” which can be accomplished only by analysis of trends across often voluminous, company-wide historical data.

While historical trending and data mining can be performed across geographies (read, utilizing or aggregating further from Phase 2 [or even Phase 1] repositories), to get the raw reporting and drilldown-supported dashboard experience against very large, corporate-wide historical data, a centralized DW implementation most likely will be the most effective choice. However, many successful BI projects will likely find a blend between the Phase 2 and Phase 3 approaches.

 

Designing Effective, Performant, Maintainable Dimensional Storage

As data warehousing has evolved, what once was a static strategy of replicating large, read-only stores for reporting has become a far more dynamic environment in which users are given expansive powers such as building their own ad-hoc queries, self-service reporting (using tools such as PowerPivot, previously codenamed “Gemini” and an extension of Office Excel 2010 that will be available in the first half of 2010 and enables users to pull down massive dimensional data to the tune of 100 plus–million rows for real-time, cached pivoting), and even write-back capabilities directly into the dimensional stores.

The power of the tools that are available to you at design time can greatly affect the strength of your models, assist visually with overcoming the complexity of the relationships, and reveal potential bottlenecks, poor query structure, and ineffective mining semantics. Through the use of the SSAS designer within Business Intelligence Design Studio (BIDS), the architect is given a comprehensive set of tools for designing and optimizing dimensional stores and queries against those stores (see Figure 2).


Figure 2. SSAS Designer—Dimensional modeling in BIDS (Click on the picture for a larger image)

 

Listed here are a few key DW principals to remember when you are designing your dimensional models to maximize performance later (more comprehensive articles on this subject and advanced SSAS design can be found at http://technet.microsoft.com/en-us/magazine/2008.04.dwperformance.aspx and at http://www.ssas-info.com/analysis-services-papers/1216-sql-server-2008-white-paper-analysis-services-performance-guide):

  1. The overwhelming majority of MOLAP data will grow in your fact tables. Constrain the number of measures in your fact tables, as query processing is most effective against narrow-columned tables. Expand the depth of attributes in the supporting dimension tables. The benefit of breaking dimension tables into further subdimension tables, when possible (snowflake pattern), is hotly debated, although this approach generally gives more flexibility when one considers scale-out models and utilizing indexing and performance-enhancing technologies such as partitioning.
  2. Implement surrogate keys for maintaining the key relationship between fact and dimension tables, instead of enforcing foreign-key constraints, which can often manifest as compound keys that cover several columns. A surrogate key is an integer- typed identity column that serves as an artificial primary key of the dimension table. This approach can minimize storage requirements and save storage/processing overhead for maintaining indexes.
  3. While OLTP stores traditionally are highly normalized, this is far less important for dimensional stores. Denormalized data has certain advantages in extremely large stores—namely, the reduction of joins that are required in queries. In addition, database products such as SQL Server 2008 utilize a highly optimized bitmap filtering (also known as “bloom filter”) that eliminates largely redundant and irrelevant data from query processing during star joins.
  4. Prototype, prototype, prototype. Your physical design might work well for initial static-reporting needs; however, as users become accustomed to the new data that is available to them, and as their expertise grows, you will need to support ad-hoc querying in a substantial way. Ad-hoc queries have the capability to create explosive data sets, depending upon the structure of your historical modeling within your dimensions. Most database products support test/modeling partitioning. Spend ample time understanding the overall impact to your environment (including indexing/maintenance) when ad-hoc support is considered. Implementing self-service BI products such as PowerPivot, instead of open ad-hoc query support, can greatly ease demands on the server by delivering large chunks of dimensional data directly down to the client for ease of manipulation by the user for drilldown.
  5. Implement a clustered index for the most common fact table surrogate keys, and nonclustered indexes upon each of the remaining surrogate keys. As per the previous item #4, the addition of ad-hoc query support can greatly affect your indexing strategy; however, for overall common-usage patterns, this indexing strategy has proven efficient.
  6. Use partitioned-table parallelism. Because of the growth of fact tables over time, most DW architects implement a partition strategy (breaking up the fact table over physical storage devices). This is most commonly performed by using a date column, but it can be performed as a range that is based on usage patterns (supported by SQL Server 2008). SQL Server 2008 implements a new partitioned-table parallelism (PTP) feature that highly optimizes queries over partitioned tables by executing queries in parallel across all available multicore processors. For more detailed information on PTP and other new DW enhancements in SQL Server 2008, visit http://technet.microsoft.com/en-us/library/cc278097.aspx.
  7. Implement a compression strategy. Over time, data warehouses can become huge. Overhead for compression can often be overcome by the reduction in redundant data—thereby, reducing query processing time, as well as providing maintenance benefits for size of data storage. However, the general strategy remains to implement compression on the least-used data. Compression can be applied in many different ways, including at partition, page, row, and others. As per item #4, the most efficient pattern will require extensive prototyping in your model.

 

Consolidation, Aggregation, and Population of the Stores

Fortunately, when you have determined what data to access and designed your data-warehouse topology, aggregation and population is becoming a more simplified task, thanks to advancements in vendor tools. Using tools such as SSIS within BIDS (which ships with SQL Server Enterprise edition), an architect can build a “package” that can fetch, manipulate/clean, and publish data. SSIS comes with data connectors that enable the architect to design packages to access all the major database platforms (including Oracle, IBM, Teradata, and others) and a comprehensive set of data-manipulation routines that are conveniently arranged in a visual toolbox for easy drag-and-drop operation in a visual data-flow model (see Figure 3).


Figure 3. SSIS data-flow representation in BIDS (Click on the picture for a larger image)

 

SSIS allows for highly sophisticated packages, including the ability to loop through result sets; the ability to compare data from multiple sources; nested routines to ensure specific order or compensation/ retry rules when failure occurs during the data-collection exercise (for instance, the remote server in Central America is down for maintenance); and sophisticated data transformation.

While SSIS can be compared to most traditional extract, transform, and load (ETL)–type tools, it offers a far richer set of features to complement the dynamic MOLAP environment of a DW that is implemented with SQL Server 2008 and Analysis Services (for instance, sophisticated SSIS packages can be imbedded within the DW and called dynamically by stored procedures to perform routines that are dictated by conditional data that is passed from within the Analysis Services engine). Many Microsoft ISVs have even built sophisticated SSIS package–execution strategies that are being called from within Windows Workflow Foundation (WF) activities. This gives the added possibility of managing highly state-dependent (or long-running) types of data-aggregation scenarios.

For a more detailed overview of SSIS, visit http://download.microsoft.com/download/a/c/d/acd8e043-d69b-4f09-bc9e-4168b65aaa71/ssis2008Intro.doc.

 

Highly Distributed Data, Global Geography, Mobile-User Data Marts, Ad Hoc Query, and Data-Synchronization Considerations

Perhaps the most daunting challenge for enterprise architects who are designing a DW solution that supports BI applications is to understand the potential impact of a design upon the physical network assets. Will the environment be able to sustain replication of massive amounts of dimensional data on ever-increasing historical data?

A key to success is utilizing effective read-only snapshot replication of predetermined aggregated subsets. This means a careful analysis of the needs of each independent geography and the use of advanced features within database products, such as extensive data compression and “sparse” attributes on columns. SPARSE is a new column attribute that is available in SQL Server 2008 and removes any physical data size to null values. Because data warehouses are typically full of enormous amounts of null field values (not every shoe from every store in every region is purchased every day), the compression and removal of the physical size of null value fields is essential. Many traditional data- warehouse products do not have this capability, and SQL Server 2008 has many performance advantages for the replication of large volumes of dimensional data.

Another effective strategy is to grant write-back and ad-hoc query capabilities judiciously. These features by necessity create larger overhead in design to support downstream geographies and can greatly increase replication requirements and the possible reconstitution of the aggregated dimensional stores (a very expensive operation, as data volumes increase in size).

Fortunately, data replication has become a seamless art, thanks to improvements within database vendors. Management of replication partnerships, rollback on failures, and rules on data-consistency violations all can be handled effectively within the replication- management console. Together with significant performance enhancements of both the replication engine and physical data size, global enterprises can rely upon the SQL Server 2008 toolset to meet even the most demanding data-warehouse and downstream data- mart strategies.

But what about those power users who demand offline BI analysis in a mobile fashion? Using the Microsoft platform, you can deliver powerful BI even in a disconnected paradigm by utilizing SQL Server CE or Express Edition on the client. Microsoft has worked with dozens of Global ISVs that have designed application suites that utilize large dimensional cube data on the client in a disconnected mode. You can establish replication strategies within the client-side database for when the mobile user is connected, or application designers can implement Synchronization Services for ADO.NET and manage data replication that is specific to the workflow needs within the application. For more information, visit http://msdn.microsoft.com/en-us/sync/default.aspx.

 

Conclusion

Enabling BI solutions for your enterprise first requires a significant investment into an advanced understanding of the corporate data within your access. For querying extremely large data volumes that likely include historical references (data over time), dimensional storage models such as data-warehouse design patterns (including MOLAP and others) have proven the most efficient strategy. For more detailed guidance in implementing your DW and BI solutions by utilizing SQL Server 2008, visit http://www.microsoft.com/sqlserver/2008/en/us/white-papers.aspx. In addition, for real-world scale-out experiences, visit the best-practices work that has been done by the SQL CAT team at http://sqlcat.com/.

 

About the Author

Charles Fichter (cfichter@microsoft.com) is a Senior Solution Architect within the Developer Evangelism, Global ISV (Independent Software Vendor) team at Microsoft Corporation. A 20-year veteran software and database architect, Charles specializes in business intelligence, Microsoft SQL Server, and Microsoft BizTalk Server, as well as general .NET middle- and data-tier application and database design. For the past four and a half years, Charles has focused on assisting Global ISVs with their application-design strategies. Other recent publications by Charles include the following:


Follow up on this topic

Rate this content 

Data-Integration Strategy

Derek E. Wilson

 

Today, enterprises collect and store data easier than ever in a variety of applications. Many of these applications are inside the firewall, while some are in the cloud and others in remote locations. All too often, application data is not collected and used across an organization for consistent analysis and business decisions. The ability to collect and analyze this data can benefit your company, if it is treated as an organizational asset.

To create an enterprise business-intelligence (BI) architecture, you must identify the core subject areas of your businesses, such as the following:

  • Customer
  • Product
  • Employee
  • Inventory

When these have been identified, you must further define what attributes should be collected for an enterprise view. The fields that you define will be the backbone of your enterprise BI platform. For instance, if a customer relationship management (CRM) system will allow you to capture data that shows that a customer has three children, must this piece of information be migrated to the enterprise BI system, to let everyone in the organization leverage it to make better business decisions? Knowing what attributes you must store for the collective good of the organization will enable you to begin data integration.

By leveraging Microsoft SQL Server and Microsoft BizTalk Server, an enterprise BI and data-integration strategy can be developed to integrate and store critical subject areas. The database structure should be abstracted by using an enterprise integration pattern that is known as the canonical data model. This model requires all incoming data to meet a user-defined pattern. For instance, an enterprise BI system might require the following fields:

  • First Name, Last Name
  • Address
  • City, State, ZIP Code
  • Phone Number
  • E-mail

Source applications likely store other information, such as mobile number, gender, and age.

BizTalk Server can be leveraged to receive messages from various applications and write the information in the appropriate database tables.

When the data has been collected, it can be stored in an online analytical processing (OLAP) cube and then presented to business users for decision-making. The process of loading and adding calculations to a cube allows everyone in the business to leverage the work that is done to create value from the data. As users access the cube, they get consistent answers to queries; and, when new calculations are requested, everyone benefits from the additions.

By identifying, integrating, and creating central OLAP stores, an organization can leverage data as an asset across the company.


Derek E. Wilson is a BI Architect and Manager in Nashville, TN. Visit his Web site at www.derekewilson.com.