Patterns for High-Integrity Data Consumption and Composition
by Dion Hinchcliffe
Summary: The challenge is to consume and manage data from multiple underlying sources in different formats, while participating in a federated information ecosystem, and while maintaining integrity and loose coupling with good performance. Here are some emerging patterns for the growing world of mashups and composition applications.
These days, the field of software development is just as much about assembly and composition of preexisting services and data as it is about creating brand-new functionality. The world of mashups on the Web and composite applications in the world of service-oriented architecture (SOA) are demanding that data—in very different forms and from virtually any source—be pushed together, interact cleanly, and then go about their separate ways. And these approaches demand that this interaction happens and occurs efficiently, all without losing fidelity or integrity.
Web services, SOA, and especially the Web itself are made of open standards that make it possible for systems to speak together—the ubiquitous and essential HTTP being a prime example. However, this communication does not allow one to assume what sort of data your software will have to consume in the applications of the future. While it is still a good bet that you will probably be facing some form of XML, it's now just as likely you will be faced with one of the lightweight data formats growing in popularity such as JSON. However, increasingly, it could just as easily be simple, plain text; a tree of native objects; or even a deeply-layered, WS-* style Web services stack that will be provided by the forthcoming Windows Communication Foundation (WCF).
Developers are therefore faced with serious challenges in terms of creating good data modeling, architecture, and integration techniques that can encompass this diversity. The heterogeneity of data representation forms can be quite daunting in environments with high levels of system integration. Never mind that in very loosely coupled, highly federated systems, as many applications are increasingly becoming, the likelihood of frequent change is high. Consequently, it is up to the application development community to create a body of knowledge on best practices for the low-barrier consumption of data from many different sources and formats, with matching techniques for integrating, merging, and relating them. The community also needs to ensure that integrity is maintained while remaining highly resilient to change and even the inevitable evolution of the underlying data formats.
While the patterns presented here are not authoritative and intended primarily to be guidance, they are based on the well-accepted concept of the interface contracts. Interface contracts are now common in the Web services world with WSDL as well as in many programming languages but particularly in design by contract. In an interface contract, a provider and a supplier come together and agree on a set of interactions, usually described in terms of methods or services. These interactions will cause data of interest to pass back and forth across the interface boundary.
The exact definition of the interactions, including their locations and protocols, are precisely defined in the contract, which is preferably machine-readable. Of particular interest are the data structures that are passed back and forth between supplier and client as parameters during each interaction. And it is these data structures that are really the center of the mashup/composite application model. Because it is here, when these highly diverse data structures meet, that we need to put principles of data integration and recombination into the sharpest relief.
Along these lines, then, we can get a general sense of the forces that are shaping data consumption and composition in modern distributed software systems. In rough order, these are the following.
Figure 1. Schemas embedded in an interface contract are usually the largest client dependency. (Click on the picture for a larger image.)
The interface contract as driver of data consumption and composition. When using data structures from any external source or service, such as a Web service, data structures must be validated against the interface contract provided by the source service. This validation can often be done at design time for data sources that are known to be highly stable and reliable. But in federated systems, especially those not under the direct control of the consumer, careful consideration must be given to doing contract checking at run time. Run-time contract checking in loosely coupled systems is an emerging technique, and some interesting options are at the disposal of the service consumer to avoid problems up front. The bottom line is that the interface contract is the principal artifact that drives data consumption and composition.
Abstraction impedance disrupts data consumption and composition. The physical point at which data integration occurs is increasingly outside the classic control of databases, and data-access libraries convert everything to a unified data abstraction. Data retrieved from multiple external services and in multiple forms can and do collide within the browser, inside front-end clients, or on the server side's inside code and outside the database. The data structures from these underlying services, depending on the software application, range from native objects, XML, JSON, and SOAP document/fragments to text, binary, and multimedia data. Though abstraction impedance has been a well-known problem in software for a long time, it has only been exacerbated by the proliferation of the growing number of available models for representing information. When consuming and compositing these data structures into usable aggregate representations and then back out to their source services, a number of major problems appear.
- Data format variability. The fundamental formats of the underlying data structures are often highly incompatible, and good local facilities for processing all the forms that might be encountered are not usually at hand. Particularly problematic are extremely complex formats that require sophisticated protocol stacks to handle, such as higher-order Web services with WS-Policy requirements.
- Data identity. Keys and unique object/data identifiers often are not in common formats and are not the facilities available for checking and enforcing them consistently across abstractions.
- Data relationships. Establishing relationships and associations among data structures of highly differing formats can be problematic, both for efficiency and performance, for a number of reasons. Converting all data to a common format can be expensive, both in terms of maintenance and run-time cost. It also introduces the problem of maintaining the provenance of blended data and extracting it back out as it's modified. Keeping data structures in native forms assumes good mechanisms for compositing, relating, and manipulating them.
- Data provenance. Maintaining the originating source service of a chunk of data is essential for maintaining valid context for ongoing conversations with the source and for changes to the data in particular, such as updates, additions, and deletions of the data.
- Data integrity. Modifying source data, ensuring that changes are valid according to the underlying service's interface contract, and not violating schema definitions are important. With well-defined XML documents that are supported by rich XML schemas, these validations are less of an issue. However, lighter weight programming models such as JSON and others like it often do not have any machine-readable schemas. These types of data formats, though often only available from a federated service, are at the highest risk of integrity problems since the rules for changing and otherwise manipulating them are not as well defined as the other formats.
- Data conversion. Often, there are no well-specified conversions between data formats, or, even worse, multiple choices present themselves that alter the data in subtle ways. The most frequently affected data are numeric representations, but these conversions can also affect a data source in multiple character sets, audio/video data, and GIS data almost as often.
Figure 2. Run-time checking of the interface contract on Web services (Click on the picture for a larger image)
Large interface contract surface areas. Extremely complex schemas and interface structures are growing into one of the leading problems in distributed systems data architecture. Experience has shown time and again that simpler is more reliable and of higher quality. Yet distributed services interface contracts often have large and complex XML schemas and WSDL definitions. The likelihood that changes will occur, inadvertently or not, increases with the complexity of an interface contract and its embedded schemas. While simple XML and REST approaches to services tend to encourage the desired simplicity, sometimes certain application domains are not "simple," and these service models are subject to additional problems because of the lack of interface contract standards. The bottom line is that the likelihood that changes to the contract without a remote supplier being aware of the changes increases with the size of the overall contract, because changes to large schemas are more likely to go undetected than changes to a smaller schema.
Implicit version changes. Even in well-controlled environments, changes to services and their interface contracts can happen without notifying all dependent parties. Barring changes to the way data services work currently, this pattern will only become more common in the highly federated environments of the present and those of the future. Without any change, the services you depend on will change without your knowledge or prior notification, and you will have to accept this fact as practical, inevitable, and unfortunate. Consequently, developing conscious strategies to detect version changes and handle them effectively is essential to maintain the quality of a mashup or composite application's data and functionality.
The patterns described here are intended to present a lightweight stance on data modeling and application architecture. These observations are a result of observing the dynamic world of Web mashups and composition applications that have been thriving on the Web for several years now. The nimble, minimalist approaches used by mashups in particular are an inspiration but in themselves often are not sufficient to create quality software. These patterns are intended to capture the spirit of mashups, put them in context with good software engineering practices, and to start a discussion and debate in the software community about these lean methods for connecting services and data.
Pattern 1: Minimal surface-area dependency on interface contract. This pattern is the front-end aspect of the well-known software design maxim: "Be liberal in what you accept, conservative in what you send." Most Web service toolkits, relational database frameworks, and multimedia libraries encourage dependencies on too much of the interface contract, often the entire contract, regardless of what the software depends on. For a great many of consumption and composition scenarios, a small surface area dependency is really all that is necessary. A full contract dependency introduces a tremendous amount of brittleness because changes that have no relation to the data in the data structure that is depended on will still break the client. While some data consumption toolkits are already forgiving of this dependency, there is often little control over what parts the toolkits care about.
For many applications, only direct dependencies of the needed internal data elements are appropriate. All other dependencies should be actively elided from the dependency relationship within the source data. The upshot here is that changes to the interface contract that are not of importance to the data consumer must not prevent the use of the data. The converse must be true as well; changes to the contract that matter must immediately become apparent to prevent incorrect behavior and use of the data. Examples of minimal dependencies include: XML, key paths through the schemas to data elements; relational data, only the tables, types, columns, and indexes used by the client; and multimedia, only the portions of the media structure that are required such as the specific channels, bit rates, image portion, video segment, and so on.
Pattern 2: Run-time contract checking for changes. The services that mashups and composite applications depend on for data are subject to unexpected change at any time. This change could be because of moving the endpoint to a new location, stopping use of a previously supported protocol, changes to underlying schema, or even just outright deliberate interface changes or internal service failures. The reason leads to the necessity of checking the contract to detect changes to it. It also requires being able to handle many contract changes gracefully since they may very well not affect the part of the contract you care about.
In reality, there are two types of run-time contract check. One is to check the contract specification itself, if one exists. This specification is often the WSDL or other metadata that is officially published in conjunction with the service itself. These can be checked easily and quickly with a variety of programmatic techniques, including XPath queries for XML-based schemas or other relatively lightweight techniques. Hard-coded contract checks in traditional programming languages are not as easy to implement or change and should be a second choice.
The second contract check is to do validation of the contract against the instances of data that the service provides. Interface contracts often have their own abstraction impedance with their delivery mechanisms, and it can be surprising how often inconsistencies will creep in between the data provided and the interface contract, even with otherwise high-quality toolkits.
The biggest dilemma presented by this pattern is how often to check the "check the contract" itself. Checking the contract and the instance data with each retrieval of data from its source can be time-consuming and is usually unnecessary. Ultimately, determining the frequency of contract checking is largely dependent on the application requirements and the nature of the data and its tolerance for inaccuracy and incorrect behavior. For many applications, checking synchronously may not even be an option, and it may make sense to set up an appropriately periodic background process to identify contract changes, which will inevitably occur.
Pattern 3: Reducing structures to a common data-abstraction format. In reality, the carnival of data structures and formats that mashups and composite applications have to work with will only continue to grow. Software can either manipulate them in their native data formats, which means losing opportunities to maintain relationships and enforce business rules, or it can convert all of them to a common abstraction. This conversion is the approach that data libraries like ADO.NET use with its DataSets and XML does with other non-XML data sources. Subsuming the differences in source data by converting them into and out of a common data abstraction that provides a single unified model to work with can be appealing for a number of reasons. First, relationships between the various underlying data sources can be checked and enforced as the data is manipulated and changed. Second, views into the data can take advantage of the abstraction mechanism making it less necessary to build custom Model-View-Controller (MVC) mechanisms and use existing libraries and frameworks that can process the abstraction.
Not all sets of heterogeneously formatted data structures are a good candidate for this pattern, and it may make little sense for some data formats that have dramatic levels of abstraction impedance—image data with non-image data, for example. There is also a potentially nontrivial cost for data transformation and conversion into and out of the common format. For many applications, however, this cost is entirely acceptable.
Within certain data-consumption environments, like the Web browser, there are limited choices for maintaining a common data format, and JSON and the XML DOM tend to be quite popular for browser-based solutions. On the server side the choices are far richer but are often platform dependent. XML structures, O/R libraries like Hibernate, relational databases, and even native object graphs often make excellent common abstraction models depending on the application. But the very different nature of hierarchical data like XML and object graphs is one of the classic impedance problems in computer science, and care must be taken when using this pattern.
The bottom line: if performance is not absolutely critical and the underlying data formats and schemas are amenable, this pattern can be very powerful for working with federated data sources. The disadvantage is that the common abstraction approach can certainly involve more maintenance and take on brittleness because mapping and metadata must be maintained.
Pattern 4: Native structures mediated with Model-View-Controller (MVC). Converting all source data into a common format will not be an option in many situations because 1) the processing overhead is excessive since much of the data might not be used, or 2) duplicating it in the local environment might be prohibitive in terms of resources. Or it could be because there is no common format that makes sense for all the underlying data types. In this case, creating an MVC that mediates the access, translation, and data integrity with the underlying native structures can provide the best options for both performance and data storage.
MVC is a powerful design pattern in its own right that has been proven time and again as an excellent strategy for the separation of the concerns in application software. Its use is particularly appropriate when there are multiple underlying data models in a given application. Good software design dictates that offering a unified view of the source data provides a single, clean, consistent interface that makes it easy to view, interact with, and modify the underlying data. Access to the underlying data with the MVC approach is also relatively efficient since only the data needed must be processed to satisfy most requests.
While there are many advantages with MVC, however, the disadvantages are similar to pattern 3 in that the maintenance of the MVC code can be enormous. Certainly, there are an increasing number of off-the-shelf libraries that can help software designers build MVC on both the client and the server. Be warned, though: mapping code is brittle and tedious.
Table 1. Schemas embedded in an interface contract are usually the largest client dependency.
|Text, JSON, binary||Informal, textual||Relatively efficient||Not self-describing, not ideally efficient|
|Native objects||Class definition||Unified behavior and data, encapsulation, high-level abstraction, and composition||Requires conversion of most data into objects, requiring a mapping technique|
|XML||XML schemas (XSD), Relax NG, WSDL, and many others||Self-describing, rich schema description, and extensible without breaking backward compatibility||Very size inefficient, no way to distribute behavior, and schema descriptions are limited even with XSD|
|Images||Specifications for JPEG, TIFF, GIF, BMP, PNG, and many others||N/A||N/A|
|Audio||Specifications for WAV, WMA, MP3, and AAC||N/A||N/A|
|Video||Specifications for AVI, QuickTime, MPEG, and WMF||N/A||N/A|
Pattern 5: Direct native data structure access. For many applications, especially simpler, browser-based ones, converting source data into common formats or building sophisticated MVC architectures just isn't a good option. Direct access to the data makes more sense, and the decision is often a commonsense one, having to do with the libraries at hand. Plus, as mentioned before, simpler is often higher quality and better because there is less to break or maintain.
In this pattern, which works best with less highly structured data, the native data structures are used directly without an intermediary or any data conversion, which means data stores in text, XML, JSON, and so on are manipulated natively. The drawback, of course, is that you cannot rely on the facilities that data abstraction libraries can give you to enforce integrity or track changes. However, this pattern often requires the least amount of processing or learning third-party libraries. It can be easy to develop, and because there is no conversion or data access layers to go through, it's also quite fast.
Pattern 6: All data modification is atomic. More sophisticated views of data and services prescribe properties known as ACID, for atomicity, concurrency, isolation, and durability. Usually ascribed to database systems, ACID is an excellent and practical rule of thumb for almost any concurrent data access system. Unfortunately, almost no one yet in the world of Web services and mashups has the notion of transactions in their protocols, which would confer many of the properties of ACID to the modification of data. Far from ignoring the problem, mashup and composite application developers must be acutely aware of working without a tightrope as they work with their data.
This awareness introduces significant issues with complex data modification scenarios, one being that any data retrieval and storage that is dependent on an extended conversation with underlying services will almost certainly not be protected by the same transaction boundary and related ACID properties. Software code must expect that data integrity problems will occur, especially in an extended conversation that may never finish to completion or fail partway through. Avoiding long-running conversations altogether is one option. Making software operation dependent, as much as is possible, in individual atomic interaction with underlying services is another good way. Each step in the interaction is a discrete, visible success that allows the software to offer its users clear options when a conversation with a supplier data service fails. These two options allow software developers to respect the ACID rule of thumb.
The Web itself is becoming the most important supplier of highly federated data, a source that will only grow and become more important in the next few years. While XML and lightweight data formats will likely be the dominant data structure that most software will have to work with in the foreseeable future, curve balls are en route. These curve balls include much needed optimizations in XML such as binary XML, advances in microformats, new multimedia codecs that will suddenly change an entire audio/visual landscape, and completely new transport protocols such as Bittorrent, which will make many of the patterns here problematic, to say the least.
A return to simplicity in data design is back in vogue, exemplified by the rise in the interest in formats like JSON, microformats, and dynamic languages like PHP and Ruby. This design simplicity can make data more malleable and easier to connect together. It also allows for less difficulty in writing software to care for and manage it. While the patterns presented here are overarching ones that can lead to less brittle, more loosely coupled, and high-integrity data consumption and composition, the story is really just beginning. As the Web becomes less about visual Web pages and more about services and pure data and content, becoming adept at being a nimble consumer and supplier of the information ecosystem will be an increasingly critical success factor.
Dion Hinchcliffe is chief technology officer at Sphere of Influence Inc.
"The Impedance Mismatch Between Conceptual Models and Implementation Environments," Scott N. Woodfield, Computer Science Department, Brigham Young University (ER'97 and Scott N. Woodfield, 1997)
Hewlett-Packard Development Company
"Rethinking the Java SOAP Stack," Steve Loughran and Edmund Smith
This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal website.