Reliability in Connected Systems
by Roger Wolter
Summary: Connected system applications are composed of a number of loosely coupled services often spread over a network; therefore, achieving high levels of reliability and availability for connected systems applications poses a unique set of architectural challenges. For example, if an application stops running because any one of 10 services running on 10 different servers is unavailable, the failure rate for the application is about 10 times the failure rate of the individual services. Data failure makes this issue much more critical, because a data source may be used by dozens of services. This article discusses the reliability issues that must be considered in architecting a connected services application, and shows how some of the new features in Microsoft SQL Server 2005 and Microsoft messaging products address these issues.
From hardware, software, and maintenance perspectives, reliability is expensive, so it is very important to understand the reliability requirements of your application. While an application with no reliability requirement is rare, implementing an application with more reliability than is required can be a waste of time and resources. For this reason, it is important to understand the reliability issues of connected systems, so you can architect a solution that will provide the required level of reliability, while not wasting resources by providing reliability that is not required.
In service-oriented architecture (SOA) or connected systems, services communicate with each other through well-defined message formats, which means that the reliability of the connected systems application will be influenced strongly by the reliability of the messaging infrastructure it relies on to communicate between services. We will use the example of services in an automated-teller banking application to illustrate the various degrees of messaging reliability and how they are achieved. Message handling between services is generally more complex than client/server messaging, because client/server messaging can rely on the user to make some decisions about how to handle various error situations and time-outs, while the server initiating the message exchange in server-to-server messaging must make all the decisions that the end user would normally make in a client/server interaction.
Many service-oriented applications achieve better throughput by using asynchronous messaging. With asynchronous messaging, a service sends a message to another service without waiting for a response to the message before continuing. The service processes many more requests, because it doesn't waste time waiting for responses to requests that it makes to other services, but it puts the burden of ensuring that the message is delivered and processed on the messaging infrastructure. The choice of synchronous or asynchronous message handling is usually determined by the business requirements of the application.
For example, if a customer is trying to withdraw money from an automated-teller machine (ATM), making an asynchronous request to check the customer's balance and proceeding to dispense the money without waiting for a response from the balance check is generally not a sound business decision. This scenario doesn't necessarily rule out an asynchronous request, however. The ATM might dispatch a balance request as soon as the customer selects the withdrawal option, and then let the customer enter an amount while the balance check is proceeding asynchronously. Once the withdrawal amount is entered and the balance is received, the ATM application can make the decision on whether to dispense the cash.
Let's look at how the ATM handles the balance-request message. The ATM service will send the balance-request message to the account service to get the customer's account balance. At this point, one of three things will happen: The balance will be returned successfully, an error will be returned, or the request will time out because the result was not returned in a timely manner. If the balance is returned, the ATM service continues with the transaction. If an error is returned, the ATM uses its business logic to work with it—maybe using a cached copy of the balance, or dispensing or not dispensing the cash depending on the amount requested. The hardest one to work with is the time-out. A time-out may mean the message got lost in transit, the message arrived at the account service but the response was delayed for some reason, or the response was sent and then lost on the way back.
In most cases, the best response is to try again and hope that it will work better next time. If the message was lost on the way the first time, it may get through this time. If the response was slow, the original response might arrive before the second request times out. Unless the account service is unreachable or completely gone, a retry should succeed eventually. Depending on what the problem was, the balance request might have been processed multiple times, or multiple results may be received; but as long as the ATM service is prepared to work with them, these aren't serious issues.
If the message was sent asynchronously, the ATM service might not have the information around to resend the message when a retry is required, so the messaging infrastructure will need to keep a copy of the messages it sends and resend them if required. For this type of simple request message, keeping a copy of the message in memory is probably adequate, because if the ATM loses power, the customer will have to start over anyway. If the ATM service needs a response even after a power loss, the message will have to be put into some persistent storage, and the messaging system will have to resend it when it restarts. Figure 1 shows three possible outcomes of a balance request.
For the synchronous version of this request, a simple SOAP Web service will provide the required degree of reliability. Error handling and time-out handling are the responsibility of the ATM service, so it will determine the level of reliability for the request. For the asynchronous version of the balance request, either the WS-RM channel in the Windows Communication Foundation (WCF) or express delivery in MSMQ will provide the required reliability, if the request doesn't have to survive system failures. If the message has to survive a system restart, MSMQ recoverable delivery is the appropriate choice. These messaging systems will handle the time-outs and retries that are required to deliver the message and the response, so the ATM service doesn't have to include this logic.
Now that we understand the reliability issues with a balance request, let us move on to the balance-change request that happens after the cash is dispensed. The reliability requirements for this message are a lot higher, because if it doesn't get delivered successfully, the bank will have given the customer cash without reducing the customer's account balance. While the customer is not likely to complain, the bank will not be happy if this situation happens regularly.
The balance-change message must get delivered and processed in the face of any number of network and system failures. The memory-based retry mechanism obviously is not adequate for this task, because a system failure will lose the copy of the message being kept for retries. Keeping a persisted copy of the message for retries may also be an issue, because it is possible for multiple copies of the message to be delivered because of retries. If the message decreases the customer account balance by $200, for example, delivering the message multiple times can lead to customer dissatisfaction.
The retries required to ensure reliable message delivery work only if the messages are idempotent. An idempotent message can be delivered any number of times without violating the business constraints of the application. Some messages are naturally idempotent. For example, a message that changes a customer's address can be executed multiple times without ill effects. Other messages are not naturally idempotent, so the messaging system must make them idempotent to prevent the damage that can be done when a message is processed more than once (see Figures 2 and 3). This transaction is normally done by keeping a list of messages that have already been processed; if the same message is received more than once, it will not be processed more than once.
Figure 1. Balance request (Click on the picture for a larger image)
Most messaging systems don't keep a copy of the incoming messages, but the sender assigns an identifier to each message and the receiver keeps track of the identifiers it has seen before. These identifiers must be kept until the message destination is sure that the source has thrown the message away, because if the source fails, the message might be sent again days later when the source restarts. The list of processed messages must also be persistent, because the destination service may fail between retries. The destination must also keep track of the message that was sent in response to the incoming message, because the source will keep sending the original message until it receives the response, and the reason for the retry might be that the original response message was lost.
Unless you are willing to persist and track the messages in your service logic, you will need a reliable messaging infrastructure to provide this level of reliability. Microsoft has three options for this infrastructure: MSMQ recoverable messaging, SQL Server Service Broker, and BizTalk. Which option you choose depends on your reliability requirements and application design.
Service Broker and BizTalk provide more reliable message storage than MSMQ, because they store messages in a SQL Server database while MSMQ stores them in a file. If your application can work with the occasional loss of messages when a file is lost or corrupted, MSMQ will be adequate for your needs. Some MSMQ applications guard against the potential loss of messages by storing a copy of the message in a database. If you are going to store messages, using Service Broker (which stores messages in the database, anyway) is significantly more efficient.
Service Broker and BizTalk provide roughly the same degree of reliability and defense against message, loss because they both store messages in the database. Service Broker's message handling is built into the database server logic, so Service Broker talks directly from the SQL Server process to the TCP/IP sockets, which is much more efficient than the BizTalk approach of an external process calling into the database to store messages in a table. While the Service Broker can deliver significantly more messages per second than BizTalk, BizTalk offers a large number of features—message transformation, data-dependent message routing, multiple message transports, orchestration, and so on—that Service Broker doesn't offer.
Figure 2. Withdraw-cash request, harmless retry (Click on the picture for a larger image)
In general, if reliable transfer of messages between databases is all your application requires, Service Broker is a better choice, because it is more lightweight and efficient at transferring messages than BizTalk. If, on the other hand, your application requires the message manipulation, data integrations, or orchestration features that BizTalk provides, Service Broker is probably not the right solution.
While MSMQ recoverable messaging doesn't provide the reliability levels that the SQL Server–based options provide, it does have the advantage of not requiring SQL Server at both messaging endpoints. If supporting a database at both endpoints is not an option and the application's reliability requirements can be met by MSMQ, MSMQ is probably the right choice as the messaging infrastructure. If both of the communicating services require data storage, however, the extra reliability of storing messages in the database is worthwhile.
Figure 3. Withdraw-cash-request retry, where the customer loses money (Click on the picture for a larger image)
In our ATM example, the ATM service will require local data storage for auditing, local storage of data for offline operation, and storage of reference data, so there will probably be a database in the ATM, anyway, and one of the SQL Server–based options is appropriate. The choice between Service Broker and BizTalk depends on the nonmessaging requirements of the application, the resources available at the ATM, and the message-volume requirements.
Previously we discussed reliability in message delivery from one service to another. Not surprisingly, we found that the amount of reliability required depends on what the application is doing and how critical the message data is to the application. Here, we will assume that messages are transferred with the required degree of reliability to a service and examine the reliability requirements for the service processing the messages.
One of the unique requirements of a service that processes asynchronous messages is that receiving a message from the queue is a "pull" operation. In other words, when a message arrives in a queue, it will sit there until an application executes a "receive" operation to retrieve and process the message. This requirement means that an asynchronous service must ensure that it executes when there are messages in the queue to be processed. The most common way to achieve this requirement is to make the service a Windows service managed by the Windows Service Control Manager (SCM). The SCM will ensure the service is started when Windows starts, and can be configured to restart the service if it fails for some reason.
While this configuration generally provides the required level of reliability and is generally the preferred solution when messages arrive at a constant rate, it can cause problems if the message load varies significantly. If the service is configured with enough resources to handle peak loads, it will be wasting resources when the message load is low; and if it is configured to handle the average load, it will get far behind during peak loads. BizTalk messaging runs as a Windows service, so a BizTalk application can rely on the Windows service to be there to handle incoming messages.
MSMQ addresses the message-load problem with triggers that start a messaging-processing service each time a message arrives in the queue. While this processing works well when messages arrive infrequently, when message load is high, the overhead of starting 1,000 copies of the services can be more than the service logic itself.
Service Broker provides a feature called activation to solve this problem. When a message arrives in an empty queue, Service Broker will start a stored procedure to handle the message. This stored procedure will wait in a loop for more messages to arrive and continue in this loop until the queue is empty. If Service Broker determines that the stored procedure is not keeping up with the messages coming in, it will start additional copies of the stored procedure until there are enough copies running to keep up. When the message arrival rate decreases, the queue will be empty and the extra copies will exit. Then, there will always be approximately the right number of resources available to service the incoming messages. Because Service Broker starts these procedures, it will be notified if one fails and replace the failing copy. If the service is an external application instead of a stored procedure, Service Broker provides events that an external application can subscribe to that will notify the service when more resources are required to process messages in a queue.
The other reliability issue that service execution must work with is a service failure while processing a message. If a service deletes a message from the queue as soon as it has received it and then fails before completely processing the message, that message is effectively lost. Similarly, if the service waits until it has completely processed the message before removing it from the queue, a failure between the processing step and the message removal will result in the message still being in the queue when the service restarts, and it will be processed again.
As mentioned earlier, processing the same message multiple times is not a problem if the message is a balance query, but processing a withdrawal twice can be irritating to the customer involved. The only real way to ensure that each message is processed "exactly once" is for the message processing and the queue deletion to be part of the same transaction. If there is a failure in processing, both the processing changes and the message deletion are rolled back, so everything is back to the way it was before the message was received.
A single commit operation commits both the message-receive and the message-processing actions. Similarly, if processing the message generates an outgoing message, the "send" of the outgoing message should be part of the transaction to avoid the situation where the service rolls back, but the response message is still sent. This type of message processing is called transactional messaging. Most reliable messaging infrastructures support transactional messaging.
Because the messages are stored in a different file from the database, MSMQ transactional messaging requires a two-phase commit to ensure both parts of the transaction are committed. Because Service Broker SEND and RECEIVE commands are TSQL commands, the messaging and data-update operations in a Service Broker service can be executed from the same SQL Server connection and be part of the same SQL Server transaction. Therefore, a two-phase commit is not required, which makes Service Broker's implementation of transactional messaging significantly more efficient than MSMQ's implementation.
Again, transactional messaging is required to make a non-idempotent service behave like an idempotent service, to eliminate the issues caused by message retries. If the service is inherently idempotent, transactional messaging is not required. If transactional messaging is not used, the service requester must be prepared to receive a response multiple times, sometimes over a long time span.
Now, we will look at the impact of data on service reliability. Most services access data while processing service messages, so data reliability is tightly bound to service reliability.
One of the unique aspects of asynchronous service execution is that, in many cases, the service messages are valuable business objects. In our ATM service, for example, if balance-change messages are lost because of a failure, the account balance is not changed and the bank loses money. For this reason, storing messages in the database—so they enjoy the same reliability, redundancy, and availability protections that the rest of the data stored there enjoy—makes a lot of sense. If the balance-change messages in our example are stored in the same database as the accounts, they will be lost only if the accounts also are lost. The backups, log backups, and Storage Area Network (SAN) features used to ensure that the bank-account information is not lost apply also to the balance-change messages, making the reliability of the service extremely high. If your message-reliability requirements are high, Service Broker or BizTalk have significant reliability advantages, because they store messages in the database.
One of the new features of SQL Server 2005 that can improve service reliability is database mirroring (DBM). DBM provides reliability by maintaining a secondary copy of a database that is maintained transactionally consistent with the primary database, by applying each transaction that commits on the primary to the secondary copy before returning control to the service. If the primary fails, the secondary copy of the database can become the primary in a few seconds.
Service Broker takes advantage of database mirroring to improve messaging reliability. If the account database is a DBM pair of databases, the Service Broker database in the ATM will open network connections to both the primary and secondary databases, and send messages to the primary. If the secondary database becomes the primary, Service Broker is notified immediately and messages are routed to the new primary with no user intervention or interruption.
Highly data-intensive services can take advantage of another feature in SQL Server 2005 to improve reliability. The integration of the Common Language Runtime (CLR) into the SQL Server engine means that the service logic can run into the database also. Therefore, for a Service Broker service, the logic, messages, execution environment, security context, and data for a service can be in the same database.
This single-location storage has many advantages in a system with high reliability requirements. Database servers generally have the hardware and software features to maintain the high reliability required by a database. This reliability can now apply to all aspects of the service implementation. In the unlikely event of a service failure, the whole environment of the service can be restored to a transactionally consistent state with the database recovery features. Not only is the data saved, but any operations in progress are rolled back and restarted in the same execution and security environment that they were in at the time of failure.
The loosely coupled, asynchronous nature of service-oriented applications imposes some unique reliability requirements. When architecting services, the level of reliability required for the service must be understood and taken into account. Microsoft offers a variety of infrastructures for implementing services that offer different levels of reliability. Choosing the right infrastructure involves matching the level of reliability required against the capabilities of the infrastructure. The new features of SQL Server 2005 provide a service-hosting infrastructure that offers unprecedented levels of reliability for services that require very high reliability levels.
Roger Wolter is a solutions architect on the Microsoft Architecture Strategy Team. Roger has 30 years of experience in various aspects of the computer industry, including jobs at Unisys, Infospan, Fourth Shift, and the last seven years as a program manager at Microsoft. His projects at Microsoft include SQLXML, the SOAP Toolkit, the SQL Server Service Broker, and SQL Server Express. His interest in the Service Broker was sparked by a messaging-based manufacturing system he worked on in a previous life. He also wrote The Rational Guide to SQL Server 2005 Service Broker Beta Preview (Rational Press, 2005).
"An Overview of SQL Server 2005 for the Database Developer," Matt Nunn (Microsoft Corporation, 2005)
"Building Reliable, Asynchronous Database Applications Using Service Broker," Roger Wolter (Microsoft Corporation, 2005)
Distributed .NET: "Learn the ABCs of Programming Windows Communication Foundation," Aaron Skonnard (Microsoft Corporation, 2006)
This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal website.