Quality Data Through Enterprise Information Architecture
by Semyon Axelrod
Summary: Data quality is a well-known problem and a very expensive one to fix. While it has been plaguing major U.S. corporations for quite some time, lately it is becoming increasingly painful. Higher prominence of various regulatory-compliance acts, such as SOX, GLB, Basel II, HIPAA, HOEPA, and others, necessitates an adequate response to the problem. For a significant class of data issues (semantics), it is not possible to solve the "data-quality problem" by just working with data. Unfortunately, this solution is not commonly understood at all, but the results leave very little doubt about the current (pure-data) approach effectiveness. This discussion proposes how to deal with this problem through correct information architecture.
In the classes of applications that heavily depend on enterprise data quality—business intelligence, finance reporting, market-trend analysis, and so on—a typical approach to the data-quality problem usually starts and ends with the activities scoped to the physical data-storage layer (frequently relational databases).
Not surprisingly, according to the trade publications, most of these efforts have minimal success. Given that in business applications data always exists within the context of a business process, all the attempts to solve the "data-quality problem" at the purely physical data level (that is, databases and extract transform load [ETL] tools) are doomed to fail.
This failure is because the physical level does not capture the requisite semantics to accurately communicate data across business processes. As a result, most of the semantic data issues exist at the process and organizational boundaries. The top (or enterprise) level is the focal point with the highest probability for discrepancy. Most companies do not have domain (or ontological) models. If models exist at all, the majority of them exist at the project (sub-departmental) level as logical data models. Enterprise-level business and information models are practically absent, and therefore there is no way to effectively communicate the data across organizational, department, project, or service boundaries.
Successful business data management begins by taking focus away from data. The focus initially should be on creating an enterprise architecture (EA), especially the commonly missing business and information architectures' constituents of it. Information architecture spans business and technology architectures, brings them together, keeps them together, and provides the necessary rich contextual environment to solve the ubiquitous data-quality problem. Thus, enterprise business, information, and technology architectures are needed for successful data management.
Once we have agreed on the "common" aspects of the data model, we can extend it locally to suit our needs. This extension is analogous to using a base class in programming. Similarly, having an enterprise-wide information architecture with rich semantic details lets us transform data at boundaries by mapping from one definition to another. These interfaces are analogous to the boundaries of services, with the definitions providing the contracts.
Major business initiatives in a broad spectrum of industries, private and public sectors alike, have been delayed and even cancelled, citing poor data quality as the main reason. The problem of poor information quality has become so severe that it has moved to the top tier among the reasons for business customers' dissatisfaction with their IT counterparts.
While it is hardly an argument that poor data quality is probably the most noticeable issue, in a vast majority of cases it will be accompanied by the equally poor quality of systems engineering in general, including requirements elicitation and management, application design, configuration management, change control, and overall project management. The popular belief that "[i]f we just get data under control, the rest (usability, scalability, maintainability, modifiability, and so forth) will follow" has proven to be consistently wrong. I have never seen a company where the data-quality level was significantly different from the overall IT environment quality level. If business applications in general are failing to meet (even) realistic expectations of the business users, data is just one of the reasons cited, albeit the most frequent one.
As a corollary, if business users are happy with the level of IT services, usually all the quality parameters of the IT organization effort are at a satisfactory level, including data. I challenge you to come up with an example where data quality in a company was significantly different from the rest of the information systems' quality parameters.
Therefore, what is commonly called the "poor data-quality problem" should be more appropriately called the "data-quality deficiency syndrome." It is indeed just a symptom of a larger and more complex phenomenon that we can call "poor quality of systems engineering in general." Data quality is just the most tangible and obvious symptom that our business partners observe, not the root cause. (See Resources for an article in The Architecture Journal 8 that provides an example of how an attempt to work with data architecture separately from all the other architectural issues leads to potentially significant problems.)
Since data plays such a prominent role in our discussion, let us first agree on what data is. Generally, most agree that data is a statement accepted at face value. In this article we are generally talking about the large class of data comprised of variable measurements, although the concepts apply to other types of data.
A notion that data is produced by measurements or observations is very significant. It points to a very important concept—a notion of data context or metadata—that is absolutely critical to the success of any data-quality improvement effort. In other words, a number just by itself, stripped of its context, is not really meaningful for business users. For example, you have a customer. For the billing department, that customer is defined partially by one address, and for shipping it is likely defined by another. Without the context, you are not sure which locally unique aspects of the data definition apply.
If we know the context in which the customer interfaced with the enterprise, we now have a much better understanding of the business circumstances surrounding this number. This notion is at the crux of the poor data-quality problem—lack of sufficient data context. We typically do not have enough supporting information to understand what a particular number (or a set of numbers) means, and we thus cannot make an accurate judgment about validity and applicability of the data.
IT consultant Ellen Friedman puts it this way: "Trying to understand the business domain by understanding individual data elements out of context is like trying to understand a community by reading the phone book."
The class of data that is the subject here always exists within a context of a business process. To solve the poor data-quality problem, data context should always be well defined, well understood, and well managed by data producers and consumers. For example, from credit approval, legal approval, servicing approval, appraisal approval, custodial review, and investor review perspectives we have multiple subprocesses labeled "approval." This context means there must be an agreement on the information architecture that gives the common aspects of the data and some way to negotiate and share local enhancements.
According to Joseph M. Juran, a well-known authority in the quality-control area and the author of the Pareto Principle (which is commonly referred to today as the "80–20 principle"), data are of high quality "if they are fit for their intended uses in operations, decision making, and planning. Alternatively, data are deemed of high quality if they correctly represent the real-world construct to which they refer" (see Resources). Again, this definition points to the notion that data quality is dependent on our ability to understand data correctly and use it appropriately.
As an example, consider U.S. postal address data. Postal addresses are one of the very few data areas that have well-defined and universally accepted standards. Even though an address can be validated against commercially available data banks to ensure its validity, this validation is not enough for every purpose. In our previous example, if a business uses a shipping address for billing and vice versa, or uses a borrower's correspondence address for appraisal, the results obviously will be wrong.
To further quantify the problem, Professor Richard Wang of MIT defines 15 dimensions or categories of data-quality problems. They are: accuracy, objectivity, believability, reputation, relevancy, value added, timeliness, completeness, amount of information, interpretability, ease of understanding, consistent representation, concise representation, access, and security. A serious discussion of this list would warrant a whole book. However, it is important to make a point that most of these attributes are in fact representing the notion of data context. For the purpose of our discussion on data quality, the most relevant attributes are: interpretability, ease of understanding, completeness, and timeliness.
The timeliness attribute, also known as the temporal aspect of information/data, is arguably the most intricate one from the data-quality perspective. There are at least two interpretations of data timeliness. The first concerns our ability to present required data to a data consumer on time. It is a derivative of good requirements and design, but it is of little interest to us in the context of this discussion. The second aspect is the notion of data having a distinctive "time/event stamp" related to the business process, and thus allowing us to interpret data in conjunction with the appropriate business events. It is not hard to see that more than half of the data-quality attributes in the previous list are at least associated with, if not derived from, this interpretation of timeliness.
While the temporal aspect of information quality is extremely important for understanding and communicating, it is often lost. For example, in the mortgage-backed securities arena, there are two very similar processes with almost identical associated data. The first is the asset accounting cycle, which starts at the end of the month for interest accrual due next period. The second is the cash-flow distribution cycle, which starts 15 days after the asset accounting cycle begins. This difference of 15 calendar days, during which many possible changes to status of a financial asset can take place, can make financial outcomes differ significantly; but from the pure data modeling perspective the database models in both cases are very similar or even identical. A data modeler who is not intimately familiar with the nuances of a business process, will not be able to discern the difference between the data associated with disparate processes by just analyzing the data in the database, which is why we need semantic definitions for reference.
Architecture is one of the most used (and abused) terms in the areas of software and systems engineering. To get a good feel for the complexity of the systems-architecture topic, it suffices to list some of the most commonly used architectural categories, methods, and models: enterprise; data; application; systems; infrastructure; Zachman; information; business; network; security; model-driven architecture (MDA); and, certainly the latest silver bullet, service-oriented architecture (SOA).
All of these architecture types naturally have a whole theoretical and practical knowledge associated with them. In-depth discussion about all of the various architectural and approaches is clearly outside the scope of this article. We will concentrate on the concept of enterprise architecture. The following definition by Philippe Kruchten provides for this discussion: "Architecture encompasses the set of decisions about the system structure." Eberhardt Rechtin clarifies: "A system is defined ... as a set of different elements so connected or related as to perform a unique function performable by the elements alone" (see Resources).
To emphasize the practical side of architecture development, these two definitions can be further enriched. A long-time colleague of mine, Mike Regan, a systems architect with many successful system implementations under his belt, adds: "Architecture can be captured as a set of abstractions about the system that provides enough essential information to form the basis for communication, analysis, and decision making."
From these definitions, it is clear that system architecture is the fundamental organization of a system. System architecture contains definitions of the main system constituencies as well as the relationships among these constituencies. Naturally, the architecture of a complex system is very complex, too.
A rich contextual environment (metadata) needs to exist to solve the problem, and a comprehensive set of models is needed to produce this desired metadata. The enterprise architecture modeling effort can and should produce these models. There is an emerging industry consensus that enterprise architecture should include a business process model as well as enterprise information model, but as we have already stated previously, both (business and information) models are practically absent.
As previously discussed, conventional data modeling techniques do not contain a mechanism that can provide sufficiently rich metadata, which is absolutely necessary for any successful data-quality improvement effort to be successful. At the same time, this rich contextual model is a natural byproduct of successful EA development process so long as this process adheres to a rigorous engineering approach.
To work with such architectural complexity, some decomposition method is needed. One such method is the three-layered model. All modern architectural approaches are centered on a concept of model layers. These are horizontally-oriented groups defined by a common relationship with other layers, usually their immediate neighbors above and below. A possible layering for EA can constitute a conceptual capabilities (or business process) layer at the top, the logical information technology specifications layer in the middle, and the physical information technology implementation layer on the bottom (see Resources). This model assumes an information systems-centered approach. In other words, the purpose of this architectural model is to provide an approach to successful information systems implementation. A simplified three-layered model is shown in Figure 1. Some key concepts are worth mentioning:
- First, although business strategy is not a constituent of the business architecture layer, it represents a set of guidelines for enterprise actions regarding markets, products, business partners, and clients. The business architecture captures a more elaborate view of these actions.
- Second, the model demonstrates that enterprise information models reside in both the conceptual and in the logical layers, and provide the foundation for consistent interaction between these layers.
- Third, the enterprise IT governance framework is defined in the top conceptual layer, while IT standards and guidelines that support the governance framework are implemented in the specification layer.
- Finally, the enterprise specification layer defines only the enterprise integration model for the departmental systems but not their internal architectures.
Figure 1. Enterprise architecture three-layered model (Click on the picture for a larger image)
Let us expand on the notion of business process architecture and elaborate on the layering details.
The EA framework is developed along two dimensions: horizontal and vertical. A complete traceability along both dimensions is absolutely necessary for the vertical (between the conceptual, logical, and physical layers) as well as horizontally within each layer but across the organizational boundaries.
Neither dimension of the modeling is new, but the two are not normally combined, and this approach leads to our semantic disconnect and thus to our poor data quality. I am trying to advocate a "hybrid approach." It is quite common that data warehouse sponsors are trying to advocate an enterprise data model. Operational departmental systems each have their own scope, but usually do not have any foundational business models and thus cannot possibly be mapped in a consistent manner to any common model, either horizontally or vertically.
As stated previously, all modern architectural approaches are centered on a concept of model layers. Let us expand on this notion specifically from the information quality viewpoint.
It is important to emphasize the business process layer as the foundation for our EA model. Carnegie Mellon University (CMU) Software Architecture Glossary provides this definition for EA: "A means for describing business structures and processes that connect business structures" (see Resources). Interestingly, this definition is applicable to and definitive of EA as a whole, and not just business EA.
The EA definition used by U.S. government agencies and departments emphasizes a strategic set of assets, which defines business, as well as the information necessary to operate the business, and the technologies necessary to support the business operations. (See Resources for a link to the original version of this definition.)
This definition maps extremely well to the proposed three-layered view of EA, but a word of caution is appropriate: While the three-layered model provides a good first approximation of EA, it is by no means complete and/or rigorous. It is obvious that both business and technical constituencies of the EA model can and should be in turn decomposed into multiple sublayers.
In the quest to make the business enterprise architecture layer a robust practical concept, it has morphed from the initial organizational-chart-centered (and thus brittle) view into a business process-centered orientation, and lately into a business-capabilities-centered view, becoming even more resilient to business changes and transformations.
Current consensus around EA accentuates both the business and technical constituencies of it. This business and IT partnership is highlighted even more by the advances of the SOA, which views the business's process model and supporting technology model as an assembly of interconnected services.
In the proposed three-layered view of the EA, the business process layer includes a business domain class model. Since this model is implemented at the highest possible level of abstraction, it captures only foundational business entities and their relationships. Thus, the top layer domain model is very stable and is not subject to change unless the most essential underlying business structures change. The information (or data) elements that are defined at this level of the domain model are cross-referenced against the business process model residing in the same top layer. In other words, every domain model element has at least one business process definition that references it. The reverse is also true: There is no information element called out in the business processes definitions that does not exist in the domain model.
It is also worth pointing out that only common, enterprise-level information processes and elements are captured at the top layer of EA. For example, because of historical reasons a particular enterprise consists of multiple lines of business (LOB), each carrying out its own unique business process with related information definitions. At the same time, all the LOBs participate in the common enterprise process. In this case, at the top (enterprise level), business-process layer, only the common enterprise-level process will be modeled. In extreme cases, this enterprise process will consist primarily of the interfaces between the LOB-level business processes.
Each of the enterprise's LOBs will need to have its own three-layered model, where top-level business entities and the corresponding information (or data) elements will be mapped unambiguously to the enterprise-level model entities. Needless to say, only the elements that have their counterparts at the enterprise level can possibly be mapped. By relating LOB-level definitions to the common enterprise-level equivalents, we are eliminating semantic mismatch (ambiguity) between different business units. And since our data elements are cross-referenced with the business process models, we should have enough contextual information to correlate information elements at the enterprise and LOB levels. In the most difficult cases, UML state transition diagrams should be created to capture temporal and event aspects of the business processes.
The specification layer of the three-layered EA model introduces system-related considerations and defines specifications for the enterprise-level information systems. These are the systems that need to be constructed to support the business processes defined at the top layer of the model. By defining system requirements in terms of the business processes, another major cause of low data quality is eliminated: the disconnect between the business and the technology views of the enterprise system.
For example, it is quite common for more than one system to be operating on data elements from the objects defined at the top business layer. In this case, using my method, each system specification will define its own unique data attribute, but all these attributes are in turn mapped to the one element at the top layer. This top-down decomposition, with local augmentation, helps to alleviate a problem known as the "departmental information silo."
Again, similar to the top layer, in the spirit of correlating data with the process contextual information, business use cases realizations and system use cases (or similar artifacts) are introduced at this level to provide enough grounding for the data definitions. It is important to note that in addition to the enterprise systems, the system interfaces of LOB-level systems (to support business-process connection points among the different LOBs) are also specified in this layer.
In this layer, the platform-specific implementations are defined and implemented. Multiple platform-specific implementations may be mapped to the same element defined at the specification layer. This unambiguous, contextually-based mapping from possibly multiple technology-specific implementations to a data element defined at the technology-independent specification level is the foundation for the robust, high-quality, data-management approach.
Solving the data-quality problem is really a matter of solving the information architecture problem. By having enterprise-level definitions of our common data attributes and incorporating ways to communicate local extensions, we can accommodate all required mappings in the domain and support all of our service-level agreements.
It is impossible to overestimate the importance of the two-dimensional traceability in the discussed architectural model. The first dimension—vertical traceability between the model layers—provides a foundation for rich contextual connection between the business process and the system implementation that supports this process. The second dimension—horizontal traceability within the same model layer—provides a foundation for a rich contextual connection between the hierarchical organizational units, as well as the systems implemented at their respective levels.
A robust traceability mechanism is absolutely necessary for high-quality data to become a reality. The architectural model provides a foundation for the information traceability and thus data quality, without which it is not possible to address a cluster of issues introduced by the modern business environment in general and especially by the legal and regulatory-compliance concerns.
Analysis Patterns: Reusable Object Models (Object-Oriented Software Engineering Series), Martin Fowler (Addison-Wesley Professional 1996)
"Data Replication as an Enterprise SOA Antipattern," Tom Fuller and Shawn Morgan, The Architecture Journal 8 (Microsoft Corporation 2006)
Juran's Quality Handbook, Fifth Edition, Joseph M. Juran and A. Blanton Godfrey (McGraw-Hill 1999)
"An Ontology of Architectural Design Decisions in Software-Intensive Systems," Philippe Kruchten (University of British Columbia 2004)
Systems Architecting, Creating & Building Complex Systems, Eberhardt Rechtin (Prentice-Hall 1990)
Semyon Axelrod's information-technology career spans three continents and more than 25 years of experience in various areas of software engineering, management, and information systems. He currently specializes in enterprise architecture, IT governance and leadership, strategic business and IT alignment, and legal and regulatory-compliance issues. His articles can be found in DM Review and Computer World magazines. Contact Semyon at email@example.com or his blog (adsys.blogspot.com).
This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal website.