Was this page helpful?
Your feedback about this content is important. Let us know what you think.
Additional feedback?
1500 characters remaining
Dealing with Concurrency: Designing Interaction Between Services and Their Agents
Collapse the table of content
Expand the table of content

Dealing with Concurrency: Designing Interaction Between Services and Their Agents

 

Maarten Mullender
Microsoft Corporation

June 2004

Applies to:

   Microsoft Visual Studio .NET

Summary: Create and work with software services and service interactions using design principles that will give you several ways to deal with the challenges of keeping data valid when consumers retrieve and work with the data across a network, while the service carries on with other work. (21 printed pages)

Contents

Introduction
Buckets of Local Data
Update Strategies
Conflict Resolution
Conclusion

Introduction

In this article I will discuss some of the challenges created by working with services. I define software services as discrete units of application logic that expose message-based interfaces suitable for being accessed across a network. Consumers (which can be client applications or other services) retrieve data from services and work with that data while the service carries on with other work, thus possibly invalidating that data. I will highlight some of the design principles that you can use to deal with such challenges.

I will not try to provide guidance on building offline applications, nor on preparing the client for offline use by pre-populating the local cache. Rather, I will concentrate on designing service interactions.

The Service Model

Typically, services provide both the business logic and the state management relevant to the problem they are designed to solve. When designing services, the goal is to effectively encapsulate the logic and data associated with real-world processes, while making intelligent choices about what to include and what to implement as separate services.

Services are necessarily very protective of the state that they manage, taking great care to authorize both read and write access, and to validate updates against integrity rules. Services are strongholds for the state they manage and are the definitive authorities on how to manipulate that state. They don't allow direct access to their data, nor will they expose their complete internal state. Instead, they provide copies of the data that they maintain. Services may be said to maintain a "healthy distrust" of outsiders seeking access.

Business rules govern state manipulation. They are the relatively stable algorithms associated with state, such as the manner in which an invoice is totaled from an item list, and are typically implemented as application logic.

Policies govern services. They are less static than business rules and may be regional or customer-specific. For example, certain favored customers might get discounts on goods and services. A very different example is a policy that mandates using secure sockets layer (SSL) to access a service. Policies are typically driven from lookup tables at run time (though there needs to be application logic to look up and apply the policies).

So a more complete definition of services might be, "Services are network-capable software units that implement logic, manage state, communicate via messages, and are governed by policy."

For an overview of services visit http://msdn.microsoft.com/library/en-us/dnea/html/EaAppConServices.asp.

Concepts

This paper relies on concepts that you may also encounter in other discussions. I will define these to facilitate cross-referencing.

  • Distrust and dependence

    Organizations do not share more information than needed for a smooth business process. Typically, organizations don't give any information unless there is a good reason for providing the information, and if the requestor has enough credibility. I will not hand out my credit card number unless I buy something and I trust the seller.

    Likewise, in service-oriented design it is good to minimize the amount of information given. More information means more dependence, and that means more expensive change. More information also means more exposure, and that means more risk—the risk that the information may be used, and the risk that the information may become stale.

    On the other hand, less information may mean less user-friendliness.

  • Views and staleness

    The service never exposes all of its internal state. If a client requests information, about a product for instance, the service will provide a view on that entity. Views give limited insight into the service's state. Still, the views form the basis for the communication with the service.

    As soon as the view is populated and the database is unlocked, other processes may change the underlying data. Any data that lives outside the service's boundary may be based on outdated information and be stale. The local view was retrieved somewhere in the past and may even be "inconsistent" in itself. For instance, if I retrieve a General Ledger account in one call and its itemization in the next, new items may have been added during the elapsed time and the sum of the items would be different from the year-to-date total in the account. Click here for larger image.

    Figure 1. View data is not current

  • Business actions, postings

    Modeling service interactions as postings (subtraction or addition) rather than as absolute updates reduces the chance that one client has to wait for another, or that one client's update is refused because another client's update happened between data retrieval and update.

    Also, inserting new information is an operation that has no or little possible concurrency issues.

    Think posting and insertion, where possible.

    Modeling the request as a subtraction, addition, or insertion will not always be possible, but it is only seldom that you can just change master data. Most times a master data change is the result of a business action that carries side effects. Such an update is not a simple optimistic update.

  • Re-orderable

    Re-orderable (also sometimes called commutative) means that two or more requests can be sent in any order, leading to the same result. From a single client, the message sequence may be manageable; however, if the requests stem from multiple clients, being re-orderable would imply that:

    • If I send a request and you send a request, it doesn't matter which one is handled first, we would still get the same result. Our requests can be reordered or rearranged. If you and I retrieve data and I request an update before you do, the result may be different from when your request is handled first. If we both order widgets, and there are enough, we have no problem (the requests are re-orderable). If we both try to order that one antique, however, we do have a problem (the requests were not re-orderable).
    • This is another way of saying that we can work concurrently; we do not influence each other's preconditions.

    Some problems are inherently not re-orderable; the challenge is to design so as not to add to the problem.

  • Repeatability or idempotency

    Repeated handling of the same request should not lead to problems. Making sure that this is designed into the service simplifies the client and communication infrastructure design. Making every request uniquely identifiable helps create a robust service. A journal is a natural solution, which combines multiple logically connected postings in a single officially numbered, and thus uniquely identifiable, business document.

  • Stable data

    Stable data is data that doesn't change its meaning.

    When creating an order, we may use the pricelist of last Monday, at 8 AM. If the system only publishes pricelists once a day and keeps these pricelists around for an extended period, we may always check the order requests against the pricelist to verify whether the order should be denied based on outdated or incorrect price information. Whoever retrieves that pricelist with a given timestamp or validity period gets the exact same copy, and no matter when the pricelist is interpreted the product numbers on the pricelist will always refer to the same products. (This implies that the product numbers are not being recycled). The format of the pricelist may differ; I may get a paper copy or an electronic copy, I may get one with extended product descriptions or just with short product names, but the meaning and the interpretation remain the same.

  • Predictability

    Changes made to the local view, predicting the changes that will be made by the service after reconnecting. These changes are used as a convenience to the user or as a basis for further processing. Ideally these changes are recognizable.

  • Reconciliation

    We need to gracefully handle errors and exceptions that occur during journal processing. The principle of least surprise combined with the principle of the least user inconvenience (reduce duplication of tedious work).

Going Offline

When working offline, the applications normally need data on the local machine to work. They need data to be available as information for the user and data to enable logic execution. They need to store requests locally in order to process them when they establish a connection.

An e-mail system is a nice example: All the e-mail messages that I receive are copied to the laptop and I can read them offline; the e-mail message that I write is stored locally. After reconnecting, the e-mail folders will synchronize. I will receive new messages, and those I wrote will be sent. Because updates may occur on both sides, however, the e-mail system cannot simply change the data on both sides and synchronize the data; it needs to keep track of the changes (they must be recognizable) and merge them, as opposed to merging the data. As a simple, almost classical, example, an e-mail message that is deleted on one side should be deleted on the other side and not recreated. Simple replication algorithms that only look at the current state and do not keep track of changes will never delete anything, because they cannot tell which is most recent.

Instead, in our example, the e-mail system on the laptop maintains changes in a change log (journal) and propagates them to the server, and then the server propagates its changes to the laptop. This too may be done using a change log. You can implement this either as change log per client, or a central change log for which the client keeps track of how much was synchronized. Assuming the service owns the data, there are additional options available, such as synchronizing the data views and overwriting the local views.

On the local system, we may use the data for several purposes. We can distinguish between:

  • Data that is only referenced, and unlikely others will change in the service in the meantime, such as pricelists, tax tables, product descriptions, and customer credit limit.
  • Data that is only referenced, but others may change in the service, either indirectly, such as stock on hand, or directly, such as other people's free/busy time.
  • Data that is being manipulated, such as orders, the Inbox and Outbox, address information, and customer order volume.
  • Data that is being created, such as new e-mail messages and new orders.

All of these may create problems when synchronizing the local state with the central state: pricelists may change centrally while working offline; the customer order volume may change on both sides; or a new order may be created on both sides. There are various strategies to deal with possible conflict between the changes on both sides, and we will discuss these in more detail later in the article. Which method we choose depends on many things: our goals, the volatility of the data, the impact of conflict, and other things I will discuss below. Once we understand the nature of the problem, we may:

  • Avoid change on either side.
  • Reduce the dependencies between the two sides and thus reduce the chance of a conflict.
  • Correct the situation automatically after a conflict occurs, where possible.
  • Reduce the annoyance caused by a conflict.
  • Require human intervention.

I will mainly focus on the first two and briefly discuss the others.

Going Online

If we take an application, running in a single process on a single machine for a single user, we do not have any concurrency issues. As soon as we want to share the data between users, however, we have to think about concurrency issues. I get some data from the database, you get some data, and we both change it; now what do we do?

Take the scenario of a company that has call centers around the world to provide around-the-clock support. If these call centers use central services to store the information, such as call history and all types of support information, there may be a large latency in the communication. The call center application is online, but if the user has to wait for one action to finish communicating with the central service before they can initiate the next, calls may extend longer than absolutely necessary. Every second that can be shaved off the average call time constitutes huge cost savings. To achieve this, the application requires some parallelism. This parallelism, however, requires local data that will be referenced, manipulated, or created. This results in many of the same problems as in the offline case.

In any business-to-business (B2B) or enterprise application integration (EAI) scenario, you have to deal with views on data retrieved from a service. Regardless of whether that data was retrieved a few milliseconds or a week ago, there is always the chance that the service has changed the data in the meantime (Figure 1). The probability of change is not necessarily a function of the time the data has been outside the service .The chance that the stock price changed over the last few milliseconds may still be higher than the chance that my company's address changed during the last week. The chance always exists, however, and it does not matter whether the view was stored on disk, just shown on the screen, or kept in memory. As soon as the copy of the data is under way from within the service to the outside world, you have to think about concurrency issues.

I argue that online work and offline work face many of the same challenges, but that the boundary conditions differ and we therefore can make different tradeoffs. In an offline scenario one is more restricted in the tradeoffs that can be made than in an online scenario. In an online scenario you can do a request to the service and have the user wait until the response arrives and only then allow the next activity, so that the application knows whether the previous action was successful. In an offline scenario this is hardly possible.

In an online scenario, you may also allow the user to take the next step while waiting for the response on the previous request. In that case, you have to deal with exceptions that arrive after the user started or even finished subsequent steps. This is the only way you can work in an offline scenario, but the exceptions will have to be dealt with much later than in the online case, possibly leading to different forms of user interaction. For example, you may have to show a list of synchronization errors to resolve, whereas in the online mode you could do this right on the original screen.

I would also argue that, for the solution designer, there is neither absolute online nor absolute offline, but that there is a complete continuum from always connected with a fast connection, through connected with a slow line, to sometimes connected. Therefore, an online application often has to deal with offline aspects, and an offline application is sometimes used online.

Chance and Impact

You can try to reduce the chance of conflict, or try to reduce the impact of conflict, or try to do both. The ideal situations would be:

  • There is zero chance that a request will fail because of changes made by others. Then we would not care about the impact of such a failure.

    Or

  • There is no impact of such a conflict for the user and then we would not care about the chance.

The side effects of such "ideal" solutions may far outweigh their benefits. For instance, if I lock out all other users of the system, I will have no concurrency issues. This is only rarely the best thing to do.

Some requests are naturally without concurrency issues. If I send a request for information to the service, there is no chance that changes made by others will block my request. If I send the request five minutes later, I may get different data; I will just get the information the service had at the time it processed my request.

In other cases, we have options. For example, if I send out a document for review and both the reviewer and I change the document, I have a concurrency issue. In this case:

  • I can reduce the chance of conflict by providing a check-in/check-out mechanism.

    Or

  • I can reduce the impact by providing a merge capability. This could reduce the impact of this issue enough that I could even see this as a feature.

    Or

  • I can reduce both of these by showing others that someone is editing the document, and still provide merging capabilities.
Click here for larger image.

Figure 2. How impact and probability affect conflict

Services and Service Agents

Interacting with a service and keeping your own state synchronized with the service's state is difficult. Not only is it difficult, it is a problem faced by every application that wants to interact with a service, whether the application presents a user interface on a notebook, the application is another service interacting with the service, or the local state is persistent or kept in memory. In all of these cases, the logic to interact with a service can be abstracted into a service agent.

A service agent:

  • Facilitates interaction with a service.
  • Facilitates connectivity transparency.

There are no constraints on where to use such a service agent. For example, you may use multiple service agents to interact with the same service. It makes sense in EAI and B2B scenarios to have service agents implement the interaction, and thus it makes sense to have two service agents implement the two sides of a conversation.

A Service Agent Facilitates Interaction with a Service

Service agents may help in dealing with the concurrency issues discussed in this article, and with managing transactional integrity between requests. They may also help in dealing with dependencies between requests in offline scenarios.

Clients may have different needs. The first may need some caching capabilities for performance reasons, but as the client only supplies information it does not need to deal with concurrency issues. The second client may need to support offline scenarios, and therefore needs to manage more local state on behalf of the user's interaction. The first does not need to care that much about connectivity; the second one definitely needs to be aware of connectedness. Since service agents facilitate the needs of client applications, service agents that are used in different scenarios must be tuned differently. In different scenarios you may want to use different implementations to interact with the same service, even for the same type of conversation. The functionality provided by a service agent depends on the needs of the client, as well as on the service that it accesses; using an application on a cell phone differs from using the same application on a desktop, for instance—the two have completely different cache capacities and connection levels for which to design.

Services typically interact for many purposes. They have conversations with various other services around various topics. The human resources system may communicate with the financial system around payments, with the benefits system, and with the procurement system. The order management system will communicate with one set of systems to receive orders, and with various other systems to fulfill orders. These conversations can be modeled in contracts, and service agents can help in facilitating the interaction through such a contract, or possibly through multiple such contracts. That is, there may be various service agents that each facilitate a certain portion of the interaction; one implementation of a service agent may focus on discussing new orders with the order management system, another may focus on discussing product availability, and yet another may focus on discussing business information for high-level decision making. Each of these deals with specific aspects of interacting with the order management service.

Service agents can help the client formulate correct requests by doing some local checking before sending the request to the service, they may make it easier to formulate requests by offering a better-suited interface, and they may make it easier to encode and transmit the request and to handle the response. The service agent may even make authentication and encryption transparent to the client.

A Service Agent Facilitates Transparency of Connectivity

Service agents may be used in B2B scenarios, and even in EAI scenarios, where connections may be slow or intermittent; additionally, they may be used by applications running on notebooks and on mobile devices that have to deal with every possible connection state, ranging from fast connection, through shaky connection, to being completely disconnected. One of the service agent's goals is to deal with irregular connectivity and provide a consistent programming model to its client.

Such a programming model should enable the application to execute the exact same business logic regardless of the connection state. This does not mean that the application should never know what the state is, however. In some cases, the application may offer additional functionality when connected. A solution may check out documents automatically when online and allow editing when off line, but prohibit the editing of documents that have not been checked out. In other cases the application may choose to deal with incomplete information. For example, the schedule of some participants may not be available when scheduling a meeting, but for others it might be cached with the risk of being stale.

The service agent has to manage the views on the data it obtains from the service; it has to manage the data's validity, and it has to deal with the case in which the central data was changed during the time the client application was using it. In both the online and offline cases you get information and use data to formulate requests. In one case it gets there quickly, while in the other it takes a little longer.

The application still has to deal with delayed exceptions: "Sorry, you cannot accept the meeting request; the meeting has been cancelled." The application has to deal with the resulting rejections and surface conflict resolutions to the user. The service agent merely deals with the timing of these.

Buckets of Local Data

Reference Data

The first part of the synchronization strategy is to avoid problems, and the simplest means is to avoid change. Much of the data retrieved from the service does not change on the client. It is merely reference data used to show descriptions (products) or to calculate (prices). When the original data is changed in the service, however, the synchronization may fail. We will reserve the term reference data for the copy of the data that was retrieved from the service. If that copy is changed locally, we will no longer refer to it as reference data. The service's designer will have to think about the reliability of the information that is handed out. The client's designer will have to think about how much to trust the information. For example, the client may be designed not execute a stock order until it updates the pricing reference data.

As stated above, the simplest strategy is to prohibit any change of the price data on the server side. It is clear, however, that this, more often than not, will lead to some design challenges. In many cases it may be possible to change reference data overnight when nobody accesses the system, but that would mean that all users have to fetch a new copy in the morning. In some cases, like in many offline scenarios where the time of connection is unenforceable, this may not be practical.

A better solution may be to define how long the reference data will be valid and thus give it an expiry date. For example, "these prices are valid until January 1st," or "valid for this week only." This tells the client or service agent when to refresh the data, and it means that the service may have to deal with multiple copies of valid data. The design then also has to deal with requests that were created while the reference data was still valid, even though the data is no longer valid at the time of synchronization. The order that was entered on Friday, but only synchronized on Monday, has prices that were valid on that Friday. You have to answer the question of whether or not to trust the client application issuing the request, and how long after expiry to still allow such synchronization.

It is important in such a solution that the reference data will always be the same, whenever the data is retrieved. For instance, when the service agent wants to retrieve last Friday's prices, the service should either return nothing, or it should return the same data it did on Friday and on Saturday. It is not even so important whether the bits are the same, as long as they mean the same thing. It would be very, very bad if product xyz was a green widget on Friday and a red widget on Saturday. So, all the references in the data should keep referring to the same entities in the system. Likewise, it is important to be able to take the reference data and see when it was issued; to determine, say, is this Friday's pricelist?

Ideally, then, reference data would have the following properties:

  • It is used but not changed by the service agent.
  • It does not change its meaning (it is stable).
  • It is valid for a specified period (long enough to be useful).
  • It is uniquely identifiable.

Such data can easily be used without worrying too much about retrieving fresh copies of the data. This is ideal for optimizing bandwidth and for offline scenarios, as you know exactly whether the data is reliable. Throughout this document we'll refer to this as predictable reference data. Still, you may have to devise strategies for dealing with the time between expiry and when the reference data is refreshed, when you know that the data is no longer reliable and may thus be stale.

For such predictable reference data, which has the aforementioned properties, we would not need to design specific exception handling. The chance of it causing concurrency issues is not high enough to justify the effort of designing such exception handling.

We will still check for the validity of requests, but when the request is invalid we will use the normal exception handling. For example, when the cost center we used in the request is no longer valid, we will tell the user that the input was invalid or we'll throw an exception to the calling application. We will not try to figure out why the cost center was changed.

Using this last definition, we can extend the number of cases that fall into this category. Looking at a traditional application for general ledger and order management, I make the following observations:

It will only be seldom that two people in the company change the master definition of an account concurrently. This happens even less for an account in the general ledger than for accounts payable (AP) or accounts receivable (AR).This data is used by multiple applications, however. Changing was never a problem while all work was done on the central mainframe, because every using application would retrieve the information in real-time. Now that work is done external to the mainframe, we need to turn this master data into predictable reference data. This can be done in the design of the application by specifying a change time, but also by organizational means. The organization could send out an e-mail message stating when the master data will change, and then everyone has the chance to synchronize the information after the change.

Arbitrary Data

Of course we're used to living with less predictable data. Everybody that has ever dealt with selling or buying stock knows that the data you see may not be the data you get. If we think of a calendaring application that allows you to schedule appointments while offline, other peoples' availability may change unpredictably.

It is important to be aware which data may change and which may not. Not all data held by the service agent is predictable reference data. The list of open orders is not constant and the sales agents definitely want to make additions to that list, whether they are online or offline. The order entered by the sales agent and synchronized with the service will be changed by the backend system while it is being processed. That is, the status of the order will change.

We will differentiate between these types of data that have been obtained from the service, and that are ultimately owned by the service:

  • Application data—Data for which the local application may assume responsibility. For example, the new order that the local system is working on. We don't care about this data because it cannot cause concurrency issues.
  • Reference data—Data that is retrieved from a service; I differentiate three types. You should feel free to come up with the categorization that makes sense in your own environment. The goal is to identify the behavior and to find recipes for each category, thereby reducing the chance, and potential impact, of conflict.
    • Predictable data—Data that meets the four properties listed above: it is not changed by service agent, it is stable, valid for a specified period, and uniquely identifiable.
    • Personal data—Data for which the user may be thought responsible. For example, my calendar or my inbox.
    • Arbitrary data—Data that may be used by the local application, but may be changed at any time by another system. For example, someone else's availability information and the stock on hand. One simply cannot predict if and how it's going to be changed.

Note that there is no guarantee that there is no concurrent change of personal data. Perhaps it is unlikely that other users will change the data, but the same user may change the data through different systems. For example, I may use a notebook and a desktop to read my e-mail messages and make appointments. In the end, the service owns the data and the local changes are given to the service as a request. The service may therefore either fulfill or reject the request. I call this personal data. Personal data is data for which the chance of conflict is minor because it is "owned" by either the user or the system, and for which the impact of conflict is relatively minor because the user should normally be aware of the conflict.

Arbitrary data is the least reliable view on data held by a service, and the hardest with which to deal. It is worth the effort to transform this type of data, either into predictable reference data or personal data, to make it more predictable. For example:

  • Does it have to be possible to change the customer's properties, such as price group or discount group, at any given time, or can we apply some restrictions? Can we formalize these restrictions in a policy so the service agent can work with them?
  • Do we have to change the order, or can we add an order item using a posting? Even though it may still affect the complete order, in the latter case this new data is personal, since no one else is making the same addition.

This does not eliminate the concurrency issues, but it does make them more manageable.

Reduce the Dependency on Arbitrary Data

Try making arbitrary data either:

  • Predictable reference data by managing the time of validity.
  • Personal data by assigning ownership (i.e. reducing the chance of concurrent update).

Update Strategies

Pessimistic Concurrency Model

Several techniques provide reliable concurrency. Databases offer locking mechanisms to prevent multiple users from overwriting each others' updates. This is known as pessimistic concurrency. Although this is probably one of the best mechanisms to use within a Web service, it is not a recommended model for Web services to expose, nor is it recommended for any situation in which a user or client may hold such a lock for a prolonged time. One user may hold a lock for an unpredictable period, preventing all others from changing that data. Generally, holding on to resources for an extended period, whether by locking resources such as database rows and files or by holding on to resources such as memory and connections, is a bad idea. It reserves resources for one user while depriving all other users of these resources. In service-oriented architecture, the service has no influence on the time needed by its consumer to process before sending an update, nor can it even influence whether its consumer will do that update at all. Therefore, services should not use pessimistic concurrency on their consumers' behalf.

Recommendation: Confine Pessimistic Resource Usage

I recommend using pessimistic concurrency behavior only in the confines of a system that is completely under control. That is, a system in which you control how long a user or processes can lock or reserve resources. This reduces and controls the impact of resource use on other users.

Optimistic Concurrency Model

Another mechanism is optimistic concurrency, also called optimistic locking. When requesting an update, the original data is sent along with the changed data, and the service checks whether the original data still matches the current data in its own store (normally the database). If it does, the service executes the update; otherwise it denies the request (an optimistic failure occurred). You can of course optimize this by using a timestamp or an update counter in the data, in which case only the timestamp needs to be checked.

This form of optimistic concurrency is a practical mechanism for updates on master data. Master data are the properties of an entity. This is data that does not change very often, can be read by everyone, has minimal preconditions, and only rarely gets updated—an update to the customer's phone number, for example. In such situations, the chance of running into an optimistic failure is minimal. For other tasks, however, such as entering postings to a General Ledger, this technique is less optimal:

  • Imagine returning an optimistic failure every time a General Ledger entry causes the sales tax (or VAT) account to be updated.
  • Nor is it good for ordering stock at the current stock price. The stock price is too volatile, leading to a huge chance of optimistic failure.

As the chance of optimistic failure increases, the usability obviously decreases. If I have updated the contact information during a visit to the customer and the synchronization fails because someone else has upgraded the customer in the meantime, the problem is a result of the software design that implemented optimistic concurrency blindly. I want the software to only check what is needed to verify that I am working with correct and up-to-date information.

I typically use a broader interpretation of optimistic concurrency. If there is more data involved than just the data that is updated, that data needs to be checked, as well. If the request uses some previously retrieved information to formulate the request, this information may be outdated, and may need to be checked by the service within the operation.

More generally, the preconditions that were used to create the request are sent along with the request, and these preconditions are then verified by the service. If there are more preconditions, more information has to be sent along. These preconditions may be, but do not have to be, based on retrieved information. For instance, if I look at the stock price and I decide to sell my shares, I may formulate the request to sell these shares. If the price drops in the meantime, I will have a problem, however. So, I would rather specify a minimum price; this price may be based on the current stock value, but it doesn't have to be.

Recommendation: Limit Optimistic Updates

The point is, I recommend using a broader optimistic concurrency model in which you decide on and make explicit what you do, and what you do not, want to be checked. Limit the narrower optimistic concurrency behavior to updates of master data with very limited side effects, and to master data that does not change too often—in other words, only for changes with a small or acceptable likelihood of running into an optimistic failure.

Postings

For situations like the general ledger posting, there is an easier and more logical way of offering concurrency—directly updating a general ledger account. The request specifies the account number and the desired posting, and the service then takes this request and adds the desired amount to the debit or credit side of the account.

This avoids having optimistic failures altogether by avoiding the use of arbitrary data. The posting does not specify the balance of the account, which is arbitrary data; instead, it requests addition or subtraction. The posting still uses reference data. That is, it uses the account numbers to formulate the posting, and we may safely assume that these will be constant.

Of course, this design pattern does not eliminate all optimistic failures. As discussed in the previous section, there may be other preconditions. For example, I could set the price as a precondition of an order, as in, "order 20 widgets for the price of $14.83." In this example, an outdated price would lead to an optimistic failure.

Neither does it eliminate exceptions in those cases where the resources are limited. If I want to book a specific room in a hotel because of the great view, I can formulate that as a posting, as in, "add Room x to my reservation." The service may still reject my request, however, because the room is no longer available at the requested time, even though it was available just moments ago. This problem may be the result of the business requirements. If, instead, the hotel treats all of its rooms as interchangeable resources by categorizing the rooms (smoking, non-smoking; view, non-view), it offers more concurrency while there is more than one room available (as well as being commutative). This is a business solution to the business problem of concurrency. Postings do not solve the business problem, but they do not add to the problem, either.

Recommendation: Design Postings

I recommend using the pattern of postings as the preferred type of business action for communicating with services, to minimize the occurrence of optimistic failures.

Business Actions

Let us imagine a nice old-fashioned bank where the tellers send documents to the back office using pneumatic tubes. The teller can formulate a request, make a copy of the request, put the copy in a carrier, and send it through the tube to the back office, where the request is handled. Then the teller stores the original in his local filing cabinet. This is very much like sending messages to a service; the service agent formulates a request and sends a copy to the service. If the service formulates a response, it sends a copy to the agent.

Suppose the teller requests customer information from the back office. The back office sends a copy of the customer information, and keeps the original. When the customer information needs correction, the teller can do one of the following:

  1. Replace the information on the copy (using an eraser) and send it as request to change the customer info, or, slightly more realistically, take a clean customer form, fill it out completely, and send it.
  2. Annotate the copy by adding the correct data in red, send it, and ask for the annotated changes.
  3. Take a customer change request form, provide the new data, and send a copy of the document.

When we look at the usefulness of each, we find the following:

Replacement requires the other side (the back office) to either simply replace the information without much checking, or try to figure out what changes were requested, by comparing the request to the information stored in the backend system and validating the changes.

The annotation seems much friendlier to the back office. The changes are readily recognizable and the assumption (the original information) is clear.

But how friendly is it really? If the address has changed, for instance, was that because there was an error or because the customer moved? In either case, do we have to only update the address in the bank account, or do we also have to notify others, such as the insurance systems? How do we find out that further action is required? Also, if we need to take further action and send the request on to others, can we just send them a copy, or do we have to formulate another request? This method, consequently, requires interpreting the requested value and deducing subsequent actions from it.

Another drawback is that the recipient must check whether the combination of changes makes sense and is allowed. Can the teller update the address, upgrade the customer, and change the married status of the customer all at once?

It also doesn't seem to make much sense to use this type of change request for more volatile data, because the chance of a concurrency issue would be too large. For all of these other requests dealing with volatile information, such as depositing money or cashing a check, we use the third method, which introduces standard request forms for each type of request. Each form applies to a very limited set of request types, so each form is used to perform a simple activity; updating the customer's account, for example, or updating the cash register and then putting some money in the pneumatic carrier; or, it is used to start a more complex collaboration, such as an address change, that needs to flow through the organization.

Why does a bank have these forms? Well, among other things:

  • It is easier to figure out what the request is.
  • It is easier to provide process for standardized requests.
  • It limits the amount of data on the request to what is required.
  • It provides hints as to what information is needed.
  • It is easier to track, document, and audit the work.
  • It is easier to route work to specialist back office employees for streamlined processing.

These specialized forms are used to formulate what I call a request for a business action.

I believe that requests should always be sent with a purpose in mind:

  • I want to withdraw $10.
  • Now I need to retrieve some data (account information such as type, number, or balance) in order to formulate my request (create a withdrawal).
  • Then I submit my request. (To get $10 from savings. Neither to change my local state from $100 to $90 nor to change the service's state.)

I communicate with the service with a purpose; that purpose is not to change my local data, but to get some business functionality done. The service on the other side does not interpret my request as a simple data change; it treats it as a request for service, and eventually people will do something and something will physically happen, which is then recorded in the business systems. I want to withdraw $10, which has an influence on both my account and on the dollar count of the register. You don't want me, the client, to formulate the calculation, and then have the service check that all numbers I gave were correct and that I subtracted the same amount from both accounts. You want me to state that I want $10.

If I formulate my request carefully, I can make the preconditions more explicit in the design and I can avoid adding unnecessary preconditions. If I think of my request as what it is, a request and not just a data update, I make it easier to anticipate all the possible side effects, and I can think more clearly about the assumptions made while formulating the request.

If some formatted data is retrieved and then changed locally, and the new data, together with the original data, is sent back, then I believe that this is just an update to a record; in principle, it is just a database request to a service. I would like to see the nature of the request, the values that are needed to fulfill the request, and the preconditions upon which the requestor relies.

An update is a request to change data. A business action is a request to perform an action on my behalf (from the user's standpoint) that is more complex than just a data change. Either case may require some form of optimistic checking.

A posting is a special case of a business action. It is clear that postings can easily be applied to numeric data, for which addition and subtraction apply. They do not rely on arbitrary data, so they reduce concurrency issues, and they are re-orderable. If I say "deposit $400 in my account," it does not matter what the original balance is. Adding information rather than replacing information makes managing concurrency much easier. For instance, sending e-mail messages just adds to the Outbox and the Sent Items folders. I can send e-mail messages from two machines without having to worry about concurrency. A doctor will add information to a patient's record and, even if two doctors add information at the same time, they do not conflict with one another—the presentation to the reader should provide a logical sequence of information.

This does not mean that formulating all business actions as postings is a panacea; you must be careful to create design restrictions that are not rooted in the business problem. An address change is an operation that may cause a conflict, because you could have the current address and possible future addresses, which may at some point be contemporaneously valid. Not allowing this is a design restriction, not rooted in the actual business problem. Multiple versions of pricelists and such all help solve a problem that is caused by design rather than by real business. To solve such problems, order entry systems may support validity periods for price lists. Similarly, human resource systems commonly offer multiple slots for employee addresses, each with a validity period, in order to manage address change. In each of these cases, the information is made more reliable and thus easier to use.

Recommendation: Design Business Actions

I recommend using business actions for formulating requests. By formulating the request and the preconditions or assumptions, one can reduce the chance of conflict and make that chance explicit.

Journals

In the case of general ledger postings, there is another elegant and often used solution; the postings are entered in a journal and this journal is sent to the service. The journal can then be checked for consistency and completeness before all the postings are executed.

The service agent can create the journal and send it to the service in its entirety or, alternatively, the service can keep the journal and the service agent update the server-side journal. For example, the shopping cart at Amazon.com is a server-side journal. A smart client has a client-side journal. Journals have been used for general ledger posting for several centuries already; they form valuable documents for tracking and justifying the general ledger balances.

Note that orders, invoices, and many other documents used in current practice are forms of journals. They are business documents, formulating a request to a service, or, going back in history, to a system or to a department. After being accepted by the service, these business documents form an important reference for further business actions; in other words, journals are retained until long after execution of the request.

Sometimes a business action looks like an update. For example, in a request to change an existing order, that is "change the order line from '11 widgets' to '20 widgets'," the update operation has to check that the entity data is the same as is assumed in the request. The system has to verify that it has the additional widgets, plan to replace the ordered widgets in stock, and account for other effects of the business transaction.

This last example is very much like the update of master data, because the data of the business document (the order document) changes. Changing this order data has an obvious side effect, however; it is much like entering a new order line.

The same business action can, and perhaps should, be modeled as a new, compensating, posting in the journal specifying the difference in the order: "cancel 9 widgets". Not only would this avoid the immediate update, it would add the business capability of tracking the order history. Similarly, in a general ledger, journal entries are never changed; they may instead be compensated by adding other postings.

A journal is a business document:

  • That combines a set of related postings or business actions in a single document.
  • That can be uniquely identified.
  • That can be sent as a request to a service.
  • That has an expiry date.
  • In which entries are not changed (after first submission), but rather allows compensating entries.

The journal is part of the "data contract" of the service; the service publishes the schema of the journal. The journal can manage the dependencies between the posting requests it contains, it can document the audit trail, and its unique identification can prevent double postings.

Recommendation: Use the Journal Pattern

I recommend using what I call the journal pattern on top of postings and business actions. It is easy to explain to business managers, and it makes the design more reliable. It helps in making the requests uniquely identifiable, thus supporting an audit trail, and it helps in making the dependencies between business actions manageable.

I recommend not changing the postings or business actions in the journal, but rather adding new "compensating" postings or business actions.

Reliability

While business actions reduce the nasty side effects of concurrency, they leave another problem open: what if the request is executed twice? An update of master data does not have this problem—the customer telephone number can be updated multiple times but the end result will be the same—and when using optimistic checking, the transaction will be refused the second time. The general ledger posting and the order submission, on the other hand, had better get executed exactly once. Lower-level infrastructure such as message queuing, which offers reliable and transactional messaging, may help, as long as the same request is not entered into the queue multiple times. This is designing for repeatability (or idempotency).

Accountants, over the centuries, have had some good ideas; one of these was to require that each and every business document would have a unique number for that type of document. Thus, a general ledger journal has its unique number, as do an order and an invoice. The service can use this number to ensure that it executes the request only once. The service agent still has to ensure that it executes the transaction at least once, as it should do for the other mechanisms, as well. Together, this ensures that the request, the journal or the order, is executed exactly once.

The business will also use the document number as an audit trail. The business stores the journal and the entries in the General Ledger. Rollup will specify the source document, in this case the journal, to track the source of the change.

Defining a unique number for a business document is an art in of itself. Using a global unique identifier (GUID) is a simple solution, but accountants like sequential numbers that allow them to check for completeness. The often-used approach of defining number ranges has the disadvantage of requiring accountants to check many sequences. Pleasing the accountant is a good business goal because of the cost reduction, so it may be worthwhile spending time on a unique sequential numbers.

Another, often overlooked, aspect of reliability is the effect of the time between creating the request and the actual execution of that request. Is the request still relevant after a week, a day or even a few seconds? It is good to give the request, and thus the journal, an expiry time, or at least to think about the problem.

Predictability

Sometimes you may want to make local changes to show the user the expected result, and then allow new changes based on that result. Consider a calendaring system that keeps track of individuals' free and busy times, to be used in scheduling new appointments. After making such an appointment, that free or busy information is updated locally. This does not mean that this free or busy information is sent to the service as a request; it is your desire to change it that is the request. After sending out the requests for appointments, the free or busy data will be resynchronized from the central calendaring service.

The local system predicts the service behavior by implementing a subset of the service's business logic. Depending on many external factors, such as the person who is accepting or rejecting your request, the prediction may have a higher or lower chance of being correct. The local system may then use the results for subsequent actions. These subsequent actions now depend on the previous actions, including the responses of all invited parties. If the prediction was incorrect, either because the first action failed or because the system returned an unpredicted result, the local application must rethink the subsequent actions. Managing these dependencies is tricky but important. Making these dependencies recognizable is even trickier and more important. Think about showing all tentative values in a different color, or making it clear that the subsequent action is tentative.

Conflict Resolution

Conflict resolution and the decision whether to pass exceptions to the user are important issues in user-friendly systems.

As long as there are no conflicts, everything is simple. When we do have an exception, we can try to correct the situation automatically. In situations such as an e-mail system, this is a very valid and effective approach; if an e-mail message is deleted locally and it has already been deleted centrally, the exception may be discarded. If an e-mail message has been read on one client and has been deleted from another, the conflict resolution is simple, too; the e-mail message may be deleted. Even in an e-mail system, however, automatic conflict resolution is not always possible. If a draft e-mail message has been deleted on one system and has been edited on another, or if an appointment has been changed on two clients, the user ultimately has to decide what needs to be done. As you can see, sometimes conflict resolution can be extremely complicated or just undesirable, and in other cases it may be easy and the logical thing to do.

Even in simple cases, it still needs thought. If the request to update a telephone number asks for an update to the value already in the service, should the service respond with an exception, or should it just accept and discard the request? Should the user see an error message if the request to delete a document in a document management-system cannot find the document? If the request was to "send order" and the order had already been sent once, however, then the system would need to discover if this was an identical order to process, or a duplicate order that should be ignored.

Without wanting to underrate the complexities of an e-mail system, it is safe to say that the problem of working offline can be more complicated than that. The main difference will often be that the rules for conflict resolution are much more complex. The basic idea, however, remains the same.

The service agent stores a view locally on the service's data. It does not just update the local data and then try to synchronize; instead, it formulates the requests to the service in the form of locally stored journals, and then optimistically updates the local view if desired. For example, the e-mail system would update the Inbox while reading e-mail messages, and it would update the Outbox while creating e-mail messages. The service agent would not just synchronize the mail boxes, however; it would keep a journal of changes to send the requests for updates to the e-mail service.

On reconnection, the journals are processed and the local view is updated, either by validating the view against information from the service or by retrieving a completely new view and simply overwriting the local one.

The difficult part here is dealing with exceptions, and dependencies between exceptions. If one of the journal entries cannot be processed, the user needs to be notified. If, however, there are dependent journal entries associated with the error, they must either be skipped or possibly adapted for further processing.

Let us look at an example:

Suppose my job is to secure capacity with our suppliers for incoming orders. I just made a number of deals with suppliers, and I am synchronizing my laptop with the central system. For a specific order, I made deals with three suppliers, but the system finds that the order has been cancelled.

What would the desired reconciliation be? Should the system pop up three error messages or just one? Or should the system provide me with a list of problems in an e-mail message, a task list, or on some specific screen of my offline application? Should the system organize the exceptions around this single source, the cancelled order, so that I can understand what is going on and take consolidated action?

Typically, it is very hard and often impossible to correct the exceptions or to deal with the exceptions automatically, but it is helpful to organize the exceptions and not to flood the user with a set of unorganized error messages.

You have to understand the dependencies between the offline changes and manage these dependencies. This will help provide a more user-friendly behavior.

Conclusion

Recommendations for Designing Service Interaction

A few theoretical notes:

  • Reducing the preconditions—that is, reducing the dependencies of the request—reduces the chance of optimistic failure.
  • Reducing the chance of the preconditions being invalid reduces the chance of optimistic failure.

Thus:

  • Use postings and business actions to achieve the first.
  • Try to change arbitrary data into either predictable reference data or into personal data to achieve the second.

For predictable reference data we do not have to design for issues caused by concurrent change; we can just assume it will never happen, and that it is not worth the effort to design elaborate exception handling.

For arbitrary data we have to design for exceptions caused by concurrent change. When we use arbitrary data, we accept the chance that we will have to tell the user that we encountered a concurrency issue. We accept that we will have to tell the user that the transaction was rejected, or that we have to ask the user for a resolution. We may have to design how to turn arbitrary data into predictable reference data, and also design for multiple validity periods of specific information, such as an address change.

Arbitrary data may be problematic, and we should try to avoid using it as a precondition. That is, we should try to avoid using arbitrary data to formulate requests to the service, and avoid using it to support the users' decisions.

For personal data we can accept exceptions, because they're easy to explain to the user, and should be expected. (The user caused them in the first place, for instance, by making offline changes on two systems.)

Almost all recommendations in this article concern the design of the service interaction, and can thus be formulated as recommendations for good service design.

  • Confine pessimistic resource usage

    I recommend using pessimistic concurrency behavior only in the confines of a system that is completely under control. That is, a system in which you control how long a user or process can lock or reserve resources. This reduces and controls the impact of resource use on other users.

  • Limit optimistic updates

    I recommend limiting optimistic concurrency behavior to updates of master data with very limited side effects, and only for master data that does not change too often—in other words, only for changes with a small or acceptable likelihood of running into an optimistic failure.

  • Design business actions

    I recommend using business actions for formulating requests. By formulating the request and the prerequisites or assumptions, one can reduce the chance of conflict and make that chance explicit.

  • Design postings

    I recommend using the pattern of postings as the preferred type of business action for communicating with services, to minimize the occurrence of optimistic failures.

  • Use the journal pattern

    I recommend using what I call the journal pattern on top of postings and business actions. It is easy to explain to business managers, and it makes the design more reliable. It helps in making the requests uniquely identifiable, thus supporting an audit trail, and it helps in making the dependencies between business actions manageable.

    I recommend not changing the postings or business actions in the journal, but rather adding new "compensating" postings or business actions.

    The journal is part of the "data contract" of the service; the service publishes the schema of the journal. The journal can manage the dependencies between the posting requests it contains, it can document the audit trail, and its unique identification can prevent double postings.

  • Reduce the dependency on arbitrary data

    Try making arbitrary data either:

    • Predictable reference data by managing the time of validity.
    • Personal data by assigning ownership (i.e. reducing the chance of concurrent updates).

 

About the author

Maarten Mullender is a solutions architect for the .NET Enterprise Architecture Team; this team is responsible for architectural guidance to enterprises. During the last seven years, Maarten has been working with a select group of enterprise partners and customers on projects that were exploring the boundaries of Microsoft's product offerings. The main objectives of these projects were to solve a real customer need and, in doing so, to generate feedback to the product groups. Maarten has been working for more than twenty years as a program manager, as a product manager and as an architect, first for Nixdorf, which later merged into Siemens-Nixdorf Information Systems, and then for Microsoft on a wide range of products.

Show:
© 2015 Microsoft