Thinking Global BI: Data-Warehouse Principles for Supporting
Enterprise-Enabled Business-Intelligence Applications
Summary: This article focuses on the design principles to support a global data-warehouse (DW) architecture, the golden triumph of any successful business-intelligence (BI) application story. It draws from the Microsoft Global independent software-vendor (ISV) partner experience in designing enterprise BI applications by using Microsoft platform technologies and contains external links and references to public content that delves deeper into the design topics that are covered.
This article assumes that the reader has some basic DW understanding of a dimensional store, the underlying fact tables in which columns are known as measures, dimension tables in which columns are known as attributes, and how schemas take on star and snowflake patterns. There are many available resources to provide this overview; however, if needed, a concise overview can be found here: http://www.simple-talk.com/sql/learn-sql-server/sql-server-data-warehouse-cribsheet . This article focuses also on successful DW project strategies and advanced topics of effective design for performance.
Architects who are looking to solve enterprise BI solutions are often enticed by packaged software applications that are able to fulfill executive requests for effective reports that reveal deep analysis of business performance. Microsoft and its vast array of ISV partners have made significant inroads into fulfilling a vision for easing the burden of generating the BI dashboards that all executives dream of—providing them with up-to-the-minute results of their business strategies and the ability to drill down into specific areas. Too often left unsaid, however, is the larger 90 percent effort that is left to the supporting architect and IT force behind the glamorous UI: how to get the data; scrub and aggregate it effectively; and design appropriate, manageable, and performant dimensional stores, including ad-hoc query support, remote geography replication, and even data marts for mobile decision-maker support with ever- increasing volumes of dimensional data.
Understanding Isolated Enterprise Data, and Accessing It
Enterprise architects who are looking to aggregate application data stores into meaningful Multidimensional Online Analytical Processing (MOLAP) dimensional models are often faced with many internal obstacles to accessing source data. These obstacles are often less technical and more business-, legal-, audit-, or security- sensitive; or overhead is too restrictive, project process, or even political, as business data can represent “turf” among executives and divisions. Some of the obstacles are technology constraints such as noncompatible or proprietary solutions, legacy file formats, and nonrelational or unstructured data. But as vendor tools (especially enhancements in Microsoft SQL Server 2008, particularly with Microsoft SQL Server Integration Services [SSIS] capabilities) and service oriented–architecture (SOA) technologies advance (for example, adoption of WS* and other open connectivity standards), this is becoming far less of an issue.
However, many BI projects are stalled and/or eventually killed because of a failure by the team to understand accurately what data was required, and how to access it successfully and make it usable. Usability is a key concept. How do you take a dozen columns (with names such as “xtssalescongproc”) and consolidate them in a central fact table that has readable column names, so that end users can leverage self-service BI technologies in the future?
The following are a few general tips to help avoid the pitfalls of navigating access to isolated data:
Consider Approaching Data Warehousing in Phases
Many BI and DW projects begin with ambitions to design and build the world’s greatest aggregated, dimensional store ever, with the intention of replicating subsets of the dimensional stores to geographies (the notion of data marts). However, is this approach always necessary or even feasible? Nearly every DW and BI project underestimates the investment of time and resources that it takes to design aggregation and scrub, build, and/or replicate data to independent MOLAP stores to support global enterprise needs.
The maturity and skill of each enterprise to deliver BI solutions can vary greatly, and breaking up the effort into smaller, agile-based deliveries can help your teams gain the experience and expertise that are needed to understand how to deliver against the larger, longer- term objectives. Remember the 80/20 rule: Most of the time, you can deliver 80 percent of the required features from 20 percent of the effort. Consider a simplified approach in phases, as shown in Figure 1.
Figure 1. Proposed global BI/DW phases (Click on the picture for a larger image)
If BI reporting needs are infrequent enough, leave as much source data in place, and build multidimensional views through a distributed caching strategy to deliver a subset of business solutions quickly. Can you merely query efficiently by using advanced vendor tools against existing application stores, even across competing platforms? This is an important decision point that you will face early on. This approach can address BI concerns such as, “Tell me what is happening in my business now.” This strategy can be most effective with noncompatible vendor technologies, or when strong divisional boundaries exist between data stores. It is far more efficient (in terms of person-hour commitment from your team), and you will be able to deliver solutions quicker to the business units.
While this might result in longer wait time for users, using tools such as SSIS can help you build packages to retrieve aggregate and clean large quantities of data (even from non-Microsoft technology stores), build the dimensional relationships in memory cache, and present them to requesting applications via Microsoft SQL Server Analysis Services (SSAS). In the past, this could be practical only for relatively small quantities of data; however, database vendor optimizations are making this scenario a real choice.
Whenever, possible, build and keep the dimensional stores locally. With large volumes of aggregated data, companies can begin more intense mining to address BI concerns such as, “Show me historical performance.” To begin here, it is essential that you critically analyze how much aggregate data replication truly is needed to a centralized store. Can you more effectively populate local MOLAP stores to represent regional business-analysis needs? When you examine regional needs versus larger corporate needs, you will likely find that more of the deeper, drilldown BI reporting needs are local/ regional in nature. Enterprise-wide BI queries tend to be broader, trend based analysis that can be supported by summary aggregations from smaller dimensional stores. Cheaper, inexpensive MOLAP stores can be effectively maintained by local resources and greatly reduce the complexity of a central warehouse design, as well as mitigate potentially large network congestion and replication needs.
Consider beginning your efforts in a smaller, less-strategic division of the company or geography to test your team’s ability. This design approach is almost an inverse of the traditional DW and downstream data-mart approach: Instead, the smaller, regionalized MOLAP stores become aggregation feeds to a larger, further aggregated, summary DW store for broader trending analysis. Although business trends are pushing highly globalized patterns, the need for in-depth regional mining is increasing, too; and relying solely on a centralized DW pattern can require a massive investment in both physical and people resources to maintain and might prove overly costly, fraught with performance challenges, and overly cumbersome to enhance or change.
Build and populate a traditional, independent, centralized DW of your dreams to reach all of the more ambitious BI needs of your company. This approach will address the harder BI concerns such as the ever-elusive BI goldmine, “Predict future results,” which can be accomplished only by analysis of trends across often voluminous, company-wide historical data.
While historical trending and data mining can be performed across geographies (read, utilizing or aggregating further from Phase 2 [or even Phase 1] repositories), to get the raw reporting and drilldown-supported dashboard experience against very large, corporate-wide historical data, a centralized DW implementation most likely will be the most effective choice. However, many successful BI projects will likely find a blend between the Phase 2 and Phase 3 approaches.
Designing Effective, Performant, Maintainable Dimensional Storage
As data warehousing has evolved, what once was a static strategy of replicating large, read-only stores for reporting has become a far more dynamic environment in which users are given expansive powers such as building their own ad-hoc queries, self-service reporting (using tools such as PowerPivot, previously codenamed “Gemini” and an extension of Office Excel 2010 that will be available in the first half of 2010 and enables users to pull down massive dimensional data to the tune of 100 plus–million rows for real-time, cached pivoting), and even write-back capabilities directly into the dimensional stores.
The power of the tools that are available to you at design time can greatly affect the strength of your models, assist visually with overcoming the complexity of the relationships, and reveal potential bottlenecks, poor query structure, and ineffective mining semantics. Through the use of the SSAS designer within Business Intelligence Design Studio (BIDS), the architect is given a comprehensive set of tools for designing and optimizing dimensional stores and queries against those stores (see Figure 2).
Figure 2. SSAS Designer—Dimensional modeling in BIDS (Click on the picture for a larger image)
Listed here are a few key DW principals to remember when you are designing your dimensional models to maximize performance later (more comprehensive articles on this subject and advanced SSAS design can be found at http://technet.microsoft.com/en-us/magazine/2008.04.dwperformance.aspx and at http://www.ssas-info.com/analysis-services-papers/1216-sql-server-2008-white-paper-analysis-services-performance-guide):
Consolidation, Aggregation, and Population of the Stores
Fortunately, when you have determined what data to access and designed your data-warehouse topology, aggregation and population is becoming a more simplified task, thanks to advancements in vendor tools. Using tools such as SSIS within BIDS (which ships with SQL Server Enterprise edition), an architect can build a “package” that can fetch, manipulate/clean, and publish data. SSIS comes with data connectors that enable the architect to design packages to access all the major database platforms (including Oracle, IBM, Teradata, and others) and a comprehensive set of data-manipulation routines that are conveniently arranged in a visual toolbox for easy drag-and-drop operation in a visual data-flow model (see Figure 3).
Figure 3. SSIS data-flow representation in BIDS (Click on the picture for a larger image)
SSIS allows for highly sophisticated packages, including the ability to loop through result sets; the ability to compare data from multiple sources; nested routines to ensure specific order or compensation/ retry rules when failure occurs during the data-collection exercise (for instance, the remote server in Central America is down for maintenance); and sophisticated data transformation.
While SSIS can be compared to most traditional extract, transform, and load (ETL)–type tools, it offers a far richer set of features to complement the dynamic MOLAP environment of a DW that is implemented with SQL Server 2008 and Analysis Services (for instance, sophisticated SSIS packages can be imbedded within the DW and called dynamically by stored procedures to perform routines that are dictated by conditional data that is passed from within the Analysis Services engine). Many Microsoft ISVs have even built sophisticated SSIS package–execution strategies that are being called from within Windows Workflow Foundation (WF) activities. This gives the added possibility of managing highly state-dependent (or long-running) types of data-aggregation scenarios.
For a more detailed overview of SSIS, visit http://download.microsoft.com/download/a/c/d/acd8e043-d69b-4f09-bc9e-4168b65aaa71/ssis2008Intro.doc.
Highly Distributed Data, Global Geography, Mobile-User Data Marts, Ad Hoc Query, and Data-Synchronization Considerations
Perhaps the most daunting challenge for enterprise architects who are designing a DW solution that supports BI applications is to understand the potential impact of a design upon the physical network assets. Will the environment be able to sustain replication of massive amounts of dimensional data on ever-increasing historical data?
A key to success is utilizing effective read-only snapshot replication of predetermined aggregated subsets. This means a careful analysis of the needs of each independent geography and the use of advanced features within database products, such as extensive data compression and “sparse” attributes on columns. SPARSE is a new column attribute that is available in SQL Server 2008 and removes any physical data size to null values. Because data warehouses are typically full of enormous amounts of null field values (not every shoe from every store in every region is purchased every day), the compression and removal of the physical size of null value fields is essential. Many traditional data- warehouse products do not have this capability, and SQL Server 2008 has many performance advantages for the replication of large volumes of dimensional data.
Another effective strategy is to grant write-back and ad-hoc query capabilities judiciously. These features by necessity create larger overhead in design to support downstream geographies and can greatly increase replication requirements and the possible reconstitution of the aggregated dimensional stores (a very expensive operation, as data volumes increase in size).
Fortunately, data replication has become a seamless art, thanks to improvements within database vendors. Management of replication partnerships, rollback on failures, and rules on data-consistency violations all can be handled effectively within the replication- management console. Together with significant performance enhancements of both the replication engine and physical data size, global enterprises can rely upon the SQL Server 2008 toolset to meet even the most demanding data-warehouse and downstream data- mart strategies.
But what about those power users who demand offline BI analysis in a mobile fashion? Using the Microsoft platform, you can deliver powerful BI even in a disconnected paradigm by utilizing SQL Server CE or Express Edition on the client. Microsoft has worked with dozens of Global ISVs that have designed application suites that utilize large dimensional cube data on the client in a disconnected mode. You can establish replication strategies within the client-side database for when the mobile user is connected, or application designers can implement Synchronization Services for ADO.NET and manage data replication that is specific to the workflow needs within the application. For more information, visit http://msdn.microsoft.com/en-us/sync/default.aspx.
Enabling BI solutions for your enterprise first requires a significant investment into an advanced understanding of the corporate data within your access. For querying extremely large data volumes that likely include historical references (data over time), dimensional storage models such as data-warehouse design patterns (including MOLAP and others) have proven the most efficient strategy. For more detailed guidance in implementing your DW and BI solutions by utilizing SQL Server 2008, visit http://www.microsoft.com/sqlserver/2008/en/us/white-papers.aspx. In addition, for real-world scale-out experiences, visit the best-practices work that has been done by the SQL CAT team at http://sqlcat.com/.
About the Author
Charles Fichter ( email@example.com) is a Senior Solution Architect within the Developer Evangelism, Global ISV (Independent Software Vendor) team at Microsoft Corporation. A 20-year veteran software and database architect, Charles specializes in business intelligence, Microsoft SQL Server, and Microsoft BizTalk Server, as well as general .NET middle- and data-tier application and database design. For the past four and a half years, Charles has focused on assisting Global ISVs with their application-design strategies. Other recent publications by Charles include the following:
Follow up on this topic
Issue 22 Index