Data on the Outside vs. Data on the Inside
By Pat Helland
Summary: Pat Helland explores Service Oriented Architecture, and the differences between data inside and data outside the service boundary. Additionally, he examines the strengths and weaknesses of objects, SQL, and XML as different representations of data, and compares and contrasts these models. (26 printed pages)
The Shift Towards Services
Assumptions About Service Oriented Architecture
Outside Data: Sending Messages
Outside Data: Reference Data
Data on the Inside
Data: Then and Now
Representations of Data: Inside and Outside
Up until now, most of the discussions on Service Oriented Architecture (SOA) revolved around topics about integration of disparate systems, leveraging companies existing assets, or creating a robust architecture. All of these issues are relevant to SOA. Yet, there are other significant and engaging issues involving SOA that are worth close attention. In its goal to connect heterogeneous and autonomous systems, SOA adheres to several core design principles. One of the principles maintains that independent services involve pieces of code and data interconnected through messaging.
Indeed, services are inextricably tied to messaging in that the only way into and out of a service are through messages. However, services still operate independently of each other. Because of the unique relationship between services and messages, architects, developers, and programmers alike began asking critical questions. Some of the questions deliberated on were how does data flow between services, how are messages defined, what data is shared, how is data inside of a service different from data outside a service, and how is data represented inside and outside services.
Findings to these questions exposed seminal differences between data on the inside of a service and data that existed outside of the service boundary. Data outside a service is sent between services as messages and must be defined in a way understandable to both the sending service and the receiving service. Data inside a service is deeply rooted in its environment. Unlike data outside services, data on the inside is private to the service. In fact, it is only loosely correlated to the data on the outside.
In response to the above findings, this paper leads readers into an in depth discussion on data inside services and data outside services. Readers are introduced to different kinds of data outside services including immutable, versioned, and reference data. The discussion then turns to data inside services involving messages (operators and responses), reference data, and service-private data. Next, the temporal interpretation of data inside services and outside services is explored. Once the different kinds of data are identified, attention is given to the representation of data through an examination of three critical models: XML, SQL, and objects.
Although SOA promises to continue stimulating conversation across enterprises and in the IT industry, the buzz accompanying it may now be about data inside services and data outside services. There is now a strong momentum for enterprises to not only bring SOA into their environments, but also to achieve a deeper understanding of their services and the behavior between services and data.
One issue in SOA is on independent services involving pieces of code and data, and message interconnecting services. Each service is a unique collection of code and data that stands alone and is independent of other services. However, each service is also interconnected with other services through messaging. The latter differentiates the services from the silos existing in many environments.
Messaging carries enormous importance in SOA. Messages are sent between services and float between them. The schema definition for each message and the contract defining the flow of the messages specify the "black box" behavior of the service. Services are inextricably tied to messaging in that the only way into and out of a service are through messages. A partner service is only aware of the sequencing of the messages flowing back and forth.
Figure 1. Services and messages are tied together
Sometimes many related messages are sent between two different services. Related messages can flow between two services over the course of weeks or months. For example, an individual may reserve a train ticket on June 1, 2004 and then change the ticket to a different date on June 5, 2004. The individual may then confirm and pay for the reservation on June 10, 2004 and finally cancel the ticket on June 25, 2004. In this example, the individual sent messages every few days for a number of weeks. This is referred to as long-running work. Messages in long-running work are related, with the second message dependant on the first message and the third message dependant on the first two messages. A cookie or something similar is used to correlate the relationship between the messages in a long-running work. We avoid the phrase long-running transaction to eliminate confusion with the atomic database transactions. In addition, the word transaction suggests an activity with a beginning and an end. Most long running work impacts other applications in ways that ripple through multiple businesses without a clear boundary of where the piece of work ended and another started.
Figure 2. Multiple related messages
To summarize, messages are an important and a critical part of SOA. As the paper will show, messages must receive special care to ensure correct interpretation and to avoid confusion as they flow between different services.
Services vs. Components
There has been a natural evolution over the years involving functions, objects and components, and services. In the beginning, code was separated into functions that allowed software to be grouped into smaller and better-organized pieces. Later, components and objects evolved allowing for the encapsulation of data (member variables) within them.
Currently, services have taken center stage in the evolution process. Services provide a coarser grained form of separation with more independence between the pieces of code than functions and components. First, services always interact with each other through messages. Second, they are normally durable allowing them to survive system failures and restarts. Finally, services have opaque implementation environments where only the messaging interactions are visible from the outside.
Figure 3. Services versus components
Considering Inside and Outside Data
It is important to make a distinction between the data inside services and the data outside services. Data outside a service is sent between services as messages and must be defined in a way understandable to both the sending service and the receiving service. Data inside a service is deeply rooted in its environment.
The need to interpret the data in at least two different services makes the existence and availability of a common schema imperative. The schema should also have certain characteristics. First, independent schema definition is important. This means the sender or receiver should be able to define message schemas without having to consult each other. Second, the message schema should be extensible. Extensibility allows the sending service to add additional information to the message beyond what is specified in the schema.
Note The sender of the message may or may not be the definer of the schema for the message.
Unlike data outside services, data on the inside is private to the service. In fact, it is only loosely correlated to the data on the outside. Data on the inside is always encapsulated by service code so the only way of accessing it is through the business logic of the service.
Figure 4. Data inside and data outside services
Mainframes and monoliths—All About Data
In the past, it was typical for a mainframe or other server system to support multiple applications. Applications access a shared database and work on either a shared set of tables or different tables within the same database. Since all the tables are in a shared database on a large server, applications can perform a single transaction that accesses data from many different tables. Likewise, operations updating multiple tables can take advantage of transactional atomicity in their changes. Equally important, not only do the applications have access to all the data in the database, but they can also access tables managed by different applications. This has colored how people think about the relationship across applications since applications have immediate access to the latest and usually, most accurate information. While this type of access is contingent on such measures as security, authorization, and privacy concerns, most applications assume they can simply look in order to see the latest information.
In recent years, various economic and technological trends have resulted in applications steadily moving off to different machines. This has resulted in the fracturing of the monolith. A single giant system no longer runs all the applications in a typical organization. This, however, raises an issue because as applications move to different machines, it becomes more difficult to share the same data since data now resides on different machines. Updating across these machines also becomes difficult since the machines are designed to be independent and, typically, do not trust each other.
Figure 5. Mainframes and monoliths
Major Tenets of Service Oriented Architecture
Up to this point, there has been discussion on code and data, systems, and messages. Like any other subject matter that is under discussion or deserves writing down and sharing with others, there exist beliefs about the subject. Below is an outline of the four major tenets of Service Oriented Architecture that detail the existence of code, message format and content, the function of a service, and service compatibility.
- Boundaries are explicit. This means there is no ambiguity about where each part of the code exists. Specifically, it is clear if the code resides inside or outside of the service. The same belief exists for data. It is known if a database table lives inside or outside the service.
- Services are autonomous. This means each service has its own implementation, deployment, and operational environment. Therefore, a service can be rewritten without its partners being negatively impacted just as long as the correct message continues to be sent at the correct time.
- Services share schema and contract, but do not share implementation. Schema describes the format and the content of the messages while contract describes message sequences allowed in and out of the service. What is not known is how implementation takes place. Consider the use of an Automated Teller Machine (ATM). Most people know how to interact with these machines. They know what buttons to press and they know the outcome. For example, John enters a pin number and then presses some buttons. Most people suspect a computer is involved in the entire process. However, they are typically unaware of how it is all implemented.
- Service compatibility is based on a policy. Formal criteria exist for getting "service from the service." The criteria are located in an English document that outlines the rules for using the service. Currently, WS-Policy is working on formalizing a declarative and programmatic way to express the policy requirements.
These are the basic principles of SOA and form the basis for the relationship between services.
Challenges with SOA
No matter what technology is brought on board to deal with the plethora of complex IT systems that makeup today's enterprises, its beliefs and capabilities will be continuously challenged. In this section, attention is given to some of the existing challenges experienced by Service Oriented Architecture.
To date, two of the largest challenges experienced by SOA deal with explicit boundaries and autonomy. Explicit boundaries crisply define what is inside a service and what is outside a service. A service is comprised of code and data. Unlike functions and components, the code and data of different services are disjoint and data from one service is kept private from the data of another service. The disjoint collections of code and data reside within explicit boundaries called services.
Autonomy speaks to the independence of services from each other. For example, Service-A is always independent of Service-B. As long as the schema and contract are maintained, no adverse impact is expected. As a result, each service is free to be recoded, redeployed, or completely replaced independent of the other service.
Even with autonomy and explicit boundaries, there are often other complications such as trust issues across boundaries. There needs to be trust between services at all times. To achieve this, a service must first decide on what kind of trust it wants and second, define its own style of trust. It is common for the rules that define trust to be modeled after real interactions across businesses. After all, it is issues such as credit card fraud that made everyone think about trust in the first place.
The Debate About Transactions
Along with the beliefs and challenges that follow a technology, there are also debates. No matter where you turn there always seems to be a debate flourishing around some technology. Where SOA is concerned, one important debate is about transactions. On one side of the debate, people propose that atomic (ACID) transactions, perhaps implemented with 2-phase commit, be used across service boundaries.
Note WS-Transaction is currently involved with defining transactions that span service boundaries.
On the other side of the debate, people believe a service should never hold locks for other services because this involves a great deal of trust that the transaction's completion and, hence, record unlock will occur within a reasonable amount of time.
Upon closer analysis, this debate is really about the definition of the word service. If two pieces of code share atomic transactions, are they independent services or is their relationship so intimate that they comprise one service. There will always be bodies of code connected by 2-phase commit; however, the question is about whether or not they are in the same service.
This paper explicitly focuses on some of the challenges that arise when two services do not share atomic transactions. Just as there are pieces of code that share atomic transaction through 2-phase commit, other cases do not. This paper will examine some of the implications that arise when the independent pieces of code do not share transactions. Hence, for this paper, the term service carries the connotation of independence, autonomy, and separate transactional scopes.
Operators and Operands
In a service oriented architecture, the interaction across the services represent business functions. These are referred to as operators. The following are some examples of operators:
- Please PLACE-ORDER.
- Please UPDATE-CUSTOMER-ADDRESS.
- Please TRANSFER-PAYMENT.
Operators are part of the contract between two services and are always about the business semantics of the stated service. Operators can also be a form of acknowledgement indicating the acceptance of an operator. Consider the following examples:
- Acknowledge receipt of PLACE-ORDER.
- Acknowledge completion of TRANSFER-PAYMENT.
An acknowledgment has business-defined depths. As a result, there is a difference between acknowledging the request receipt to TRANSFER-PAYMENT and acknowledging the completion of the transfer. This all comes down to clearly defining the business semantics in the contract.
Operator messages may also contain many operands. An operand is defined as a piece of information needed to conduct an operation. It must be placed in the message by the sending service. In simplest form, operands are responsible for the parameters in messages. Some examples of operands include the identification of a customer placing an order or the SKU number for a line item within the order.
Figure 6. Operator messages with operands
Let's consider how and where a service gets the operands it uses to prepare an operator message. Operands come from reference data, which is typically published by the target service for the operator.
Reference data is somewhat of a new kind of data in SOA. The word somewhat is used since versions of SOA have been in existence for decades with MQ, EDI, and other messaging systems. Similarly, before it was all computerized, humans were manually completing SOA-style operations. When customers wanted to order products from a department store, they filled out an order form and sent it in by mail. The department store catalog is reference data and the department store order is an SOA operation.
Figure 7. Publication of reference data and use of operands
Immutable and/or Versioned Data
Data exists in many forms. One type of data is immutable data. Essentially, immutable data is unchangeable once it is written. You can find immutable data almost anywhere in the real world. The following are a few examples:
- The New York Times edition of June 3rd, 1975 is unchangeable
- The first edition of a published book is unchangeable
- The words spoken by the United States president on television are unchangeable
All immutable data have identifiers. An identifier ensures the same data is yielded each time it is requested no matter when it is requested or where it is requested. Therefore, if the same identifier is used then the same bit pattern is retrieved. For example, a person will get the same pricelist today if it is retrieved from Joe's Internet Bazaar for Thursday, July 3, 2003.
Another type of data is versioned data. Versioning is a technique for creating immutable sets with unique identifiers. Through the availability of versioning, a person can ask about the latest service pack (SP) for Windows NT4, or a recent edition of the New York Times. Versioning also has different types of identifiers: version-dependent identifiers and version-independent identifiers.
Version-dependent identifiers include the desired version as part of the identifier. This identifier always retrieves the same immutable bit pattern. In contrast, version-independent identifiers do not include the desired version in the identifier. As a result, the process for resolving version-independent identifiers to its underlying data involves two steps:
- Map from the version-independent identifier to the version-dependent identifier
- Retrieve the immutable bit pattern of the version
The following is an example of the above steps from a real world perspective. A person visits the newsstand to buy a recent edition of the New York Times. This behavior involves deciding if today's paper or yesterday's paper is needed. Given the version-dependent identifier, today's paper, the person buys today's paper dated July 20, 2004. Ultimately, with version-independent identifiers the answer given will not be the same for each request. For example, if the exact event happens the following day, the person will receive a newspaper dated July 21, 2004 and not July 20, 2004.
Immutability of Messages
Every message traveling through a network maybe retransmitted in the event the message is lost. Every message sent is guaranteed to be delivered zero or more times. Considerations are based on:
- Networks losing messages.
- Networks retrying messages.
- Those pesky retries actually being delivered.
It is important for retransmitted messages to remain unaltered no matter how many times they are sent or else a great deal of confusion and unhappiness would ensue. Therefore, all messages should be immutable.
Additional consideration needs to be given to messages traveling through the network. In the absence of a reliable messaging framework using serial number and retries, the end application may see duplicate messages. Additionally, careful thought must be given to the life of the messaging framework and the life of the endpoint. Consider a case when long-running work may take weeks or months to complete. TCP/IP cannot be counted on to eliminate duplicates in this situation. If the system crashes and is restarted, a different TCP connection is obtained and may result in duplicate messages being sent. Because confusion may arise from messages being resent, it is advantageous to have immutable messages so the same bits are always returned.
Figure 8. Immutability of messages
To Cache or Not To Cache
Most conversations on caching usually end with a warning against the practice. This is not one of those cases. Caching immutable data is acceptable and, indeed, recommended because each time the data is requested the same answer is guaranteed. There is no possibility of error in this situation. As a result, the cache never has to be shot down. Caching data that is not immutable is risky because it can lead to anomalies.
It is also acceptable to cache versioned data since each version is immutable. There is never confusion over what data is returned from a cache involving a version-dependent identifier. The version-dependent identifier yields the same bits each time.
Note It is not recommended that anything be cached if it is referred by a version-independent identifier. The results yielded in this situation are unpredictable.
Normalization and Immutable Data
Normalization is an essential design consideration for database schemas. Because normalization ensures each data item lives in one place, it is easier to ensure updates do not cause anomalies. This is illustrated in a classic example involving an employee-manager database. The manager's phone number is commonly listed in each row for each employee in the database. A problem is encountered when trying to update the manager's phone number because it lives in numerous places.
Figure 9. Problems with de-normalization
If the data is immutable, it may be practical to allow it to be de-normalized. For instance, it is acceptable for email messages to be de-normalized since the messages cannot change once they are sent out of a service. Likewise, if a message is sent between services and will be interpreted by the business logic of the services, it may be challenging to perform joins. Because of this, immutable messages frequently contain de-normalized information.
Messages can be sent and resent, but at the end of the day if both the sender and the receiver do not understand the messages then nothing has been accomplished. As a result, every message needs a common schema. The schema used is typically referred to as meta-data. In the event the meta-data is ever changed, confusion will result. Accordingly, the schema used must always be known or should be able to be located if the message is going to be processed.
Stability of Data
Ensuring the immutability of data across service boundaries eliminates many problems, but it does not ensure the message is understood. For example, a reference to President Bush made in 2003 means something different than a reference to President Bush made in 1990. People may fail to notice the reference is to two different people and therefore, misunderstand the message.
The notion of stable data is introduced as having an unambiguous interpretation across space and time. This leads to the creation of values that do not change. For example, most enterprises never recycle customer identifications or SKU numbers. It is problematic to ensure the old interpretation no longer exists so these values are permanently assigned. Consider a banking situation. If a single piece of customer information comes out of the bank's archive years later and the bank tries to look up the customer's identification, it had better not refer to some other customer. By not reusing a customer's ID, it remains stable.
Note It is also worth mentioning that anything that is current is also not stable. A reference to the current activity on a credit card is not stable because it does not clearly communicate when the snapshot of the activity was accurate.
In summary, data that is distributed must be both immutable and stable. Versioning is an excellent technique to create immutable data. Finally, the data must refer to immutable schema. The combination of these, results in interpretations of the message that is invariant across space and time.
A Few Thoughts on Stable Data
Everything sent outside must be stable data so the interpretation of each message continues to be consistent across valid places and times to ensure the information is understood. Even data inside is sometimes stable. Cases like these occur when the data sent outside is also kept on the inside. Take for example a shopping basket and the product SKUs inside the basket. SKUs sit in a shopping basket service until they are sent to the order fulfillment service. When the SKUs are in the shopping basket service, they must be stable because they are living across multiple interactions with the inside.
Validity of Data in Bounded Space and Time
By bounding data in space and time, it is known where and when the data is valid. Placing an expiration date on data such as, "Offer is good until next Tuesday" is one example of the validity of the data in bounded time. There may also be information on where data is valid. Examples follow:
- The offer is only good to Washington State residents
- Data is valid only on these two servers
- The information is valid only for the Acme & Sons Company Accounting application
It is imperative that all valid data also be immutable and stable. Moreover, if valid data is retrieved then the same data should be produced and its interpretation should be unchanged.
Before deciding data is valid everywhere and at all times, consider if ranges in validity are beneficial. If they are then document the constraints in the message. Ultimately, it is wise to define the validity of any data sent outside of a service.
Rules for Sending Data in Messages
The following table offers some distilled rules for sending messages outside of service boundaries:
Table 1. Rules for sending data
|Identify the messages||
|Ok to Cache||
|Define Valid Ranges||
|Must be Stable||
What Is Reference Data?
Reference data is information published across service boundaries. For each collection of reference data there is one service that creates it, publishes it, and periodically sends it to other services. There are three broad uses for reference data: operands, historic artifacts, and shared collections of data. Sometimes, the distinction between their uses is not rigid and may overlap.
Operands and Operators
Operands add crucial information such as parameters or options to create the operator requests sent out for processing. The following are examples of operands:
- The customer-ID for the customer placing the order
- The part numbers for the parts being ordered
- The expected shipment date and the price agreed to for the order
The data for operands is published as reference data. Reference data is typically sent out on different schedules as required. Consider the following:
- The customer database is sent daily as a snapshot
- The parts catalog is updated weekly
- The price-list is updated daily
Versioned reference data is published by the authoritative service so its partners have the timely operands needed to ensure information accuracy. It is essential that the operator requests are processed understanding that the operands are derived from versioned reference data. This is just like specifying that an order to a department store catalog is based on the Fall and Winter edition of the catalog.
Historic artifacts are another type of reference data. Its purpose is documenting past information within the transmitting service. Related services receive and use historic artifacts to perform other business operations. Examples of historic data include:
- Quarterly sales results
- Monthly bank statements
- Monthly bills
The reference to monthly bills needs further discussion. Not only do monthly bills include historic artifacts about activity during the past month, but monthly bills also request customers make a payment. This request defines a business function. Therefore, monthly bills also behave as operators.
The use of historic artifacts raises many privacy concerns. However, in almost all cases, historic artifacts are only shared between closely related services that are trusted, or the receiving partner is sent specific pieces of the service's data appropriate for the partner's viewing. An example of tightly related and trusted services involves quarterly sales results. These results are published by the sales supporting services and sent to the accounting department's services. Inventory status rollup is then passed to accounting. Alternatively, a bank statement or a phone bill sent to a customer illustrates historic artifacts that are transmitted to a single partner.
Figure 10. Publication of Historic Artifacts
Reference data sometimes shares the same collection of data across an enterprise or different enterprises. Even after this collection of data is accessed across an enterprise, it continues to evolve and change. Typically, there is one special service that owns this information and is responsible for updating and distributing new versions of the data across systems. Examples of shared collections of data include:
- Customer database - It contains all the relevant information about the customers of the enterprise.
- Employee database - It contains information about every employee in the enterprise.
- Parts database and pricelist - It contains descriptions of the parts, SKUs, and their characteristics. Also included are the prices for the various SKUs and the discount policies for customers.
It is worth noting that the version distributed across an enterprise is never the latest version of the data. It is generally impossible to have universal knowledge of the latest information in a loosely coupled distributed system. As a result, those interested in the data must settle for a recent version because only the authoritative service knows the latest state of the information. Even if the authoritative service tries to meet a request for a more recent view of the information, the data can change by the time the partner system receives it. By the time the partner service sees it, the authoritative service cannot guarantee that it is the most recent copy.
Although shared collections of data have proven their place in the world, they represent a huge challenge for most enterprises. Many different applications have their own opinions of the correct value of the data for a customer. For instance, differences about what constitutes a customer exist. There is also a difference of opinion on the data needed to describe a customer. Lastly, the representations of certain data items describing a customer are not semantically aligned across different applications. However, there is currently a trend to consolidate these disparate opinions. This is, unfortunately, much like herding cats!
The general course of action is to create an authoritative source to manage the state of each shared collection of data and distribute a recent version to those requesting it. Disparate applications, for instance, are evolving to receive descriptions of the shared data from the customer manager service. In an attempt to align this data, the schema for each customer includes enterprise wide standard fields and includes extensions used by special applications. However, nothing can align perfectly across an enterprise. Also, some of the interested parties have their own extensions, which should be managed by the authoritative source. This presents yet another challenge.
Requesting Changes to Shared Data
Sometimes a service other than the authoritative service wants to see changes made to the contents of the shared collection of data. Because this service cannot directly update the contents itself, a request is sent to the owner of the data. This allows the authoritative service to manage the data. These requests must be business operations that are represented as service operators across services. Only the supported business functionality that is desired to export from the authoritative service will be implemented as operators. If the authoritative owner of the data agrees to make a change, the change is subject to the business logic enforcement of the authoritative owner of the shared collection. There are also situations when a new version of the entire shared collection of data is transmitted to the partner services instead of a few changes. Sending a copy of the entire customer database to the partner systems is one example.
Note: No matter how minor or major, any changes made must be considered carefully since changes to shared collections of data have business side effects that must be implemented by the authoritative service.
What about Optimistic Concurrency Control?
Optimistic concurrent control refers to when a reader makes changes to data and then proposes these changes to the authoritative service. The reader sends the original view of the data back to the authoritative service. If the original data is still intact, the authoritative service makes the proposed changes.
Sometimes, it is proposed that services use optimistic concurrency control across boundaries. This makes several assumptions:
- The outside service is allowed to read the data. However, privacy issues and encapsulation issues may make this situation problematic.
- The owner of the data trusts the business logic of the outside service. After all, some logic must have executed on the outside to create the image of the data that is being proposed for write.
Updates across service boundaries have little to do with optimistic concurrency control. Instead, they have a lot to do with the logic surrounding it. It is all about trust and who can make changes to the data. Services by nature are distrusting so they will not allow foreign business logic to create data to be stored in the local database. Only the local business logic can create data to be stored in the local database. Therefore, it is important for partner services to be aware that they cannot just change the data sent to them, but should ask for the change to be completed by the authoritative owner of the data.
Example: Updating a Customer's Address
Many times, people offer a counter example to the discussion above wherein they propose that the management of a customer record should be done using optimistic concurrency control. Let's consider why this is problematic across service boundaries.
Figure 11. Versions of a shared collection and a request for update
One may initially believe that it is appropriate to allow the salesperson to update some fields of the customer record directly and submit these to the authoritative service as a "write" assuming there is no optimistic concurrency collision. This is problematic for a couple of related reasons. First, the authoritative owner of the data does not want to yield control of the ability to change the data to some other service's business logic. It wants to be responsible for the integrity of the data and, hence, want its own business logic to be responsible for the change. Second, the change to the field may have business implications that need enforcement. For example, changing the address may result in tax implications for the customer, changing the responsible salesperson, and ensuring that any in-flight shipments are redirected. Therefore, it is not enough just to change the field in the customer record. In summary, it is essential that interactions with the authoritative service be oriented around business functions such as "please update customer address."
Publishing Versioned Reference Data
To ensure reference data is not mixed up, the publisher first defines a name for the data. Version numbers are then added to each data update. Examples of versioned reference data involving operands, historic artifacts and shared collections follow:
- Operands: A price list dated Monday, July 19, 2004
- Historic artifacts: A request is made for Joe's February bank statement from two years ago
- Shared collections: A customer database dated Thursday with a 10am timestamp
Updated information is always versioned to distinguish between copies of data. When the information is ready to be sent, many transmission techniques are available like messaging or file copy. Currently, FTP is the most common transmission technique in use. Ultimately, it is not very important what technique is used to transmit information. What is important is the way operands, historic artifacts, and shared collections work with cross-service computing.
Until now, discussion has focused on data on the outside – the transmission of messages across networks, the importance of immutable and stable data, and data publication across services. The next section focuses on data on the inside.
Messages Are Data
All services receive messages. These messages contain operators asking the service to perform a function. The function may be a business instruction or perhaps a product order. The function may also be to accept some incoming reference data.
Once the services receive the messages, they record and commit them as data in a Structured Query Language (SQL) database. This ensures the data is stored and retrievable. Next, a transaction takes place that marks the incoming messages as consumed. As a part of the transaction, outgoing messages are queued in the local database for later transmission. Finally, the whole transaction is atomically committed. If the transaction terminates before the entire operation is committed, the incoming message will reappear in the queue and the transaction is retried. With the possibility a transaction may abort, outgoing messages are never sent until the transaction is processed to ensure message transmission is atomic with the rest of the transaction. If for instance, a message is sent and then the transaction is aborted, the message may still be processed even when the transaction failed. The atomicity of the transaction is violated in this situation because the transaction has not been completely undone.
In addition to the transactional support achieved by storing the incoming messages in the SQL database, there are business benefits. The contents of the messages can be easily retrieved for different purposes such as audits and business intelligence analysis. As an added benefit, data in a SQL database allows for management and monitoring of the ongoing work to be based on SQL queries.
Kinds of Data Inside Services
Inside of services, three classes of data are found:
Table 2. Data inside services
|Reference Data from Other Services||
|Messages (Operators and Responses)||
The following sections examine the characteristics and uses of the three classes of data mentioned in the above table.
Service-private data resides inside a service and is protected by the business logic of that service. Its contents are not readily known or available to anyone besides the service in which it lives. The only way to read service-private data from outside the service is by calls through the business logic, which controls what data is exposed.
Typically, this data is heavily processed to yield a particular result – highly specific information. Consider a bank's ATM. Customers are not aware of the bank's backend data from their account management system. Customers may only see their account balance or their last 10 banking transactions. This information has all been heavily processed by the bank's business logic. The only way to change the service-private data from the outside is by submitting a business operation to the service. For instance, altering bank data may occur when a customer performs a transaction such as a withdrawal. This transaction changes the bank's backend service-private data indirectly through the business operation of the withdrawal.
Replication of Reference Data
A single publisher commonly sends reference data to many subscribers. Since reference data is both immutable and versioned, it is easy to replicate across different services. It is also easy to keep many copies of the data since no semantic ambiguity exists.
Reference data may also be imported into a service. As it is read into a service, it may be processed, reformatted, and indexed. Data stored in the service remains immutable. While the syntax and internal representation of the reference data may be changed to suit the needs of the service, the semantics remain intact and are considered a representation of the same immutable data.
Note Reference data stored inside the subscribing services are replicas. However, there is no real distinction between replicas and copies of data because they are all immutable.
Figure 12. Transaction use of data on the inside
Significant differences exist between the temporal interpretation of data inside services and outside services. This section examines the different perspectives and uses of time in a service oriented architecture.
Life in the Now: Transactions and Inside Data
Transactional systems have worked hard to provide the application developer with a sense they are alone manipulating the database. Serializability refers to ensuring that for all transactions, any other transaction that interacts with the database appears to occur before or after the transaction in question.
It is reasonable to consider that transactions, as they are executing, live in the "now". Due to serializability, there is no possibility that anything else is happening concurrently from the application's perspective. Although subject to security, the application is able to examine and modify almost anything in the database viewing the most up-to-date information possible. This leads to a perspective that the application lives firmly in the "now".
Inside a service, the business logic of the service is dealing with only the latest-and-greatest view of the service's private data. In this fashion, it is said that the service logic is deeply tied to the "now" of the service and its service-private data.
Life in the World of Then: Data on the Outside
When operators are sent in messages, they are requests for business functions to be performed. In effect, the sender is hoping the service will perform the business function in the not-too-distant future. As the transaction commits on the sending node, it is perfectly clear the requested operation has not yet occurred, but will happen in the near future.
Consider the various kinds of reference data sent between services and the temporal semantics of the data. As discussed above, reference data falls into three broad categories: operands, historic artifacts, and shared collections.
Operands describe the data used to form an operator request for a business function. Just like a department store catalog, operands are typically valid for a specified amount of time. Outside of that time, the operands are likely to be invalid if used in the creation of an operator. While this range of time is likely to include the present, it clearly is not about the immediacy of the "now". It is a supported range of "then".
Historic artifacts describe what went on inside the publishing service in the past. There is no information about the use of historic artifacts dealing with the present or the future; it is only about the past.
Shared collections of data are also about the past. For example, the authoritative service may publish a view of the state of the enterprise's customers on a daily basis. Partner services do their work based on last night's view of the state of the customers. Again, the perspective is one dealing with "then".
Services: Dealing with Now and Then
One of the biggest challenges in the transition to Service Oriented Architectures is getting programmers to understand they have no choice but to understand both the "then" of data that has arrived from partner services, via the outside, and the "now" inside of the service itself.
The semantics of the business functions provided by the service must alleviate this tension. By abstracting and loosening the behavior supported by services in their messaging contracts for business functionality, it is possible to cope with the difference between "then" and "now". This is exactly what has been seen for generations in cross-business work. For example, the department store catalog includes many different products and guarantees the price as shown in the catalog for more than six months. Business logic and the intelligent design of the service contracts can cope with the impedance between "then" and "now".
The remainder of the paper focuses on the representations of data on the inside and outside. It also analyzes and compares three models, Extensible Markup Language (XML), Structured Query Language (SQL), and objects. The information presented ultimately shows seminal differences between data inside a service and data sent outside the service. Also revealed is that the strength of each model in one use becomes its weakness in another use.
Inside and Outside Data
Data can be broken down into two main classes, data on the inside and data on the outside. Data on the outside is sent across service boundaries in the form of messages and include business-function and reference data, which must be understood by both the sender and the receiver. The data is always immutable and can be versioned. Since it travels between services, versions are likely stored as replicas by the receiving services. Conversely, data on the inside lives in the service and it is rarely sent out of the service. If it is transmitted outside, it must be processed by the business logic. This class of data is by nature private to the service and encapsulated by code. Below is a table providing additional information on both classes of data:
Table 3. Data inside vs. data outside
The table identifies data on the outside as always immutable. This makes the notion of normalization unimportant. Yet, immutability is not enough because data on the outside must also be stable so its meaning is clearly understood across different times and locations. Versioning and time stamping are recommended to make data stable. Although stability is not a concern for data on the inside, it can be stable. For instance, the data can include stable items if it refers to data being sent to the outside. Since data on the inside is not necessarily immutable, it should be normalized to prevent update anomalies.
Bounded and Unbounded Data Representations
Let's consider the ways in which SQL and XML represent their data and the implications on the scope of that representation.
All data stored inside SQL databases live within the bounds of the database. Therefore, value based comparisons are only meaningful if the domain of the values is the same. Interpretation of a SQL database outside of the database itself is impossible.
The scope of the transaction used to modify the database defines the temporal bound for relating values contained inside the database. For instance, if the transaction is committed and a copy of the data is sent outside the service, the underlying values may change before the copy is used. The interpretation of the values after they are unlocked is subject to interesting semantic anomalies and is usually avoided.
Lastly, relational data also has a tightly managed schema. The schema is modifiable through Data Definition Language (DDL) within a transaction. DDL is usually tightly associated with the SQL database. Indeed, as soon as one transaction commits, the next may change the DDL that made the previous transaction meaningful! This extremely flexible behavior is correctly defined only within the bounds of the database.
Unlike SQL, XML documents and their messages are unbounded because of its open schema. XML's schema definition allows independent creation of the schema, which lets people design their own customized markup languages, or borrow portions of other schema. Moreover, each schema is identified with a Universal Resource Identifier (URI), indicating an immutable version of the meta-data. As a message is sent, the URI for its meta-data unambiguously specifies the schema for interpreting the message. This meta-data remains invariant across space and time.
Another matter that sets XML and SQL apart is that XML uses references and not values to connect information. Connections between sub-trees and across documents are done through references, which are implemented as URIs. When implemented as URIs, references are an unambiguous mechanism for connecting data together that remains intact in the face of schema changes, location changes, and temporal changes.
Because XML information lives in the "then" and is always referring to the past or the future, it can be unambiguously interpreted anytime. The notion of "now" is difficult across systems that do not share atomic transactions. This ability to specify "then" in the semantics of the data makes it interpretable both anytime and anywhere. SQL, living in the intimacy of the "now" within the SQL database cannot be accurately understood except in the "now" and inside the bounds of the SQL database or its surrogates.
Characteristics of Inside and Outside Data
The next section takes a more in depth look at data on the inside and data on the outside and further differentiates between the two by comparing message data, reference data, and service-private data against numerous characteristics.
Table 4. Characteristics of data on the inside and outside
|Outside Data||Inside Data|
|Message Data||Reference Data||Service Private Data|
|Requires Open Schema for Interoperability?||Yes||Yes||No|
|Represent in XML?||Yes||Yes||No|
|Long-Lived Evolving Data with Evolving Schema?||No||No: Immutable Versions||Yes|
|Store in SQL?||Yes: Copy of XML Stored in SQL||Yes: Copy of XML Stored in SQL||Yes|
Message data and reference data, both data on the outside, are similar in that they are both immutable with an open schema, and best represented in XML. Service-private data, on-the-other-hand, promotes encapsulation which is the opposite of an interoperable open schema. Even schema changes in different ways with data on the inside and data on the outside. For instance, when the message schema is changed, explicit versioning of the schema is used. The evolution of data on the inside may occur vibrantly as DDL changes the current state of the schema. Lastly, one reason both incoming and outgoing messages are shredded into SQL is because business intelligence is an important part of all data regardless of their location, inside or outside. Usually, the amount of XML shredded for queries is based on the amount someone wants analyzed. The extensibility of XML means that shredding is sometimes easy and sometimes more challenging as data may arrive that does not map to the schema.
Once the different kinds of data are identified and their functions understood, most people start to wonder what technologies to use. The following section examines three models, XML, SQL, and objects, and their representation of data.
What about Objects?
Most people know about the power of object-orientation to facilitate software engineering. It, too, has a place in the battle for the representation of data. People have seen object persistence, and object-oriented databases become popular and, sometimes, not live up to expectations. Still, there are important forces that encourage the use of object-orientation as a means of representing data.
The biggest advantage objects have over SQL and XML is that they provide encapsulation. Data being stored are hidden from the user of the data and only the behavior of the methods provided by the object is visible. This is very similar to the way services expose business functionality through the messages defined by their schema and their contracts. The seminal difference between objects and services is that services never share data except as reference data. This is a much looser relationship that exists across objects.
Still, the use of objects to implement services is highly recommended. It is difficult to believe that anyone would want to implement a service without the benefits of an object-oriented environment! If a service is implemented using objects, it is opaque to all of its partner services.
The ruling triumvirate: XML, SQL, and objects
It is interesting to know that each model's strength is simultaneously its weakness. For instance, it is the independence of schema definition coupled with a reference-based hierarchical representation, and the temporal interpretation of the data, which make XML wonderful for communicating across services. These immutable messages are easily created and interpreted across different services. These features derive directly from the unbounded nature of XML. Still, these features of XML for the "outside" are debilitating weaknesses on the inside. It is problematic to query, shape, and update XML with the richness available to normalized data, representing the "now", and stored in SQL. These weaknesses are also directly from the unbounded nature of XML.
Because of its characteristics, SQL makes a great tool for representing data on the inside of services. These strengths are inextricably linked to the bounded nature in both space and time of SQL, which make it fantastic for representing data on the "inside". Unlike XML, SQL has strong querying capabilities. SQL's makes comparisons between almost anything within the bounds of the database. Because of SQL's bounded nature, however, it is incapable of the strengths of XML in the "outside". SQL does not offer independent definition of schema as it depends heavily on a centralized and tightly coupled DDL.
Neither model has encapsulation capabilities. In XML, encapsulation is impossible because of its open schema. Encapsulation is also unachievable and not enforced by SQL since data changes only by UPDATE DML. Unlike XML and SQL, objects and components thrive on encapsulation. By its very nature, encapsulation prevents the arbitrary comparison of any data since the data is not visible. Therefore, it is impossible to perform queries. Extensibility and independent definition are also impossible in this model since encapsulation implies that the schema is concealed.
In summary, XML thrives in the world of communicating requests, responses, and reference data between services. It provides all of the functionality, scalability and granularity required by messages. In term of storing data, SQL database is a leader offering many outstanding benefits like storing incoming and outgoing messages. Its capabilities are further bolstered when utilized for audits, compliance, or business intelligence. When building a service, objects are recommended because encapsulation facilitates the construction of software.
Figure 13. XML, SQL, and objects working together
In the end, this information is not presented to lobby for one model over another. Instead, this information is offered to illustrate a fascinating fact – each model's strength is at the same time its weakness. It is important to recognize, however, all three models are critical in a Service Oriented Architecture.
This paper examines an intricate part of SOA, data inside a service and data sent outside the service boundary. A discussion of the roles and relationships between services, messages, and different technology models were explored to illustrate the difference between the two kinds of data.
Services are connected only by messages otherwise; they are independent and behave differently from each other. The messages sent between services describe a business function and contain operands that commonly express parameters or options for the operation. This is a much freer and less intimate relationship than traditional distributed systems. Because services are different, atomic database transactions are not shared across service boundaries. The uncertainty of joint decisions is handled by the business logic of the interacting services.
Messages are also different. Once a message is outside the service boundary, it must be immutable. In addition, special attention must be given to the stability of the data. Stability ensures the data is understood across space and time.
Reference data plays an important roll in messaging. It refers to data published across service boundaries. Operands, historic artifacts, and shared collections are all types of reference data. Operands are parameters or options used to create operator requests sent out for processing. Historic artifacts describe past events that took place inside the publishing service. Shared collections of data are used by multiple services. They are also constantly evolving. As a result, partner services usually only have access to a recent version of the data.
When discussing the representation of data inside and outside of services, SQL, XML and Objects all have a worthy place. Interestingly, the essence of what makes one of these models strong in one area of use also makes it weak in another area of use. This is the reason for the longevity of the differences across the communities of specialists in data representation.
Finally, data is different outside a service from the inside of a service. Data on the inside is described as living in the "now." The data is usually private to the service and encapsulated by service code. On-the-other-hand, data on the outside lives in the past. It is passed in messages and understood by both the sender and receiver.
About the author
Pat Helland has 25 years of experience in the software industry and has been an architect at Microsoft since 1994. He has worked for more than 20 years in database, transaction processing, distributed systems, as well as fault tolerant and scalable systems. Pat worked at Tandem Computers designed TMF (Transaction Monitoring Facility). He was one of the founders of the team that implemented and shipped Microsoft Transaction Server (MTS), now COM+. Pat has recently focused his thinking on loosely coupled application environments. Pat can be reached at firstname.lastname@example.org.