Asynchronous Messaging Patterns and High Availability
Updated: January 22, 2014
Asynchronous messaging can be implemented in a variety of different ways. With queues, topics and subscriptions, collectively called messaging entities, Windows Azure Service Bus supports asynchrony via a store and forward mechanism. In normal (synchronous) operation, you send messages to queues and topics, and receive messages from queues and subscriptions. Applications you write depend on these entities always being available. When the entity health changes, due to a variety of circumstances, you need a way to provide a reduced capability entity that can satisfy most needs.
Applications typically use asynchronous messaging patterns to enable a number of communication scenarios. You can build applications in which clients can send messages to services, even when the service is not running. For applications that experience bursts of communications, a queue can help level the load by providing a place to buffer communications. Finally, you can get a simple but effective load balancer to distribute messages across multiple machines.
In order to maintain availability of any of these entities, consider a number of different ways in which these entities can appear unavailable for a durable messaging system. Generally speaking, we see the entity become unavailable to applications we write in the following different ways:
Unable to send messages.
Unable to receive messages.
Unable to administer entities (create, retrieve, update, or delete entities).
Unable to contact the service.
For each of these failures, different failure modes exist that enable an application to continue to perform work at some level of reduced capability. For example, a system that can send messages but not receive them can still receive orders from customers but cannot process those orders. This topic discusses potential issues that can occur, and how those issues are mitigated. Service Bus has introduced a number of mitigations which you must opt into, and this topic also discusses the rules governing the use of those opt-in mitigations.
Reliability in Service Bus
There are various ways to handle message and entity issues, and there are guidelines governing the appropriate use of those mitigations. To understand the guidelines, you must first understand what can fail in Service Bus. Due to the design of Windows Azure systems, all of these issues tend to be short-lived. At a high level, the different causes of unavailability appear as follows:
Throttling from an external system on which Service Bus depends. Throttling occurs from interactions with storage and compute resources.
Issue for a system on which Service Bus depends. For example, a given part of storage can encounter issues.
Failure of Service Bus on single subsystem. In this situation, a compute node can get into an inconsistent state and must restart itself, causing all entities it serves to load balance to other nodes. This in turn can cause a short period of slow message processing.
Failure of Service Bus within a Windows Azure Data Center. This is the classic “catastrophic failure” during which the system is unreachable for many minutes or a few hours.
|The term storage can mean both Windows Azure Storage and SQL Azure.|
Service Bus contains a number of mitigations for the above issues. The following sections discuss each issue and their respective mitigations.
With Service Bus, throttling allows for cooperative message rate management. Each individual Service Bus node houses many entities. Each of those entities makes demands on the system in terms of CPU, memory, storage, and other facets. When any of these facets detects usage that exceeds defined thresholds, Service Bus can deny a given request. The caller receives a ServerBusyException and retries after 10 seconds.
As a mitigation, the code must read the error and halt any retries of the message for at least 10 seconds. Since the error may happen across pieces of the customer application, it is expected that each piece independently executes the retry logic. The code can reduce the probability of being throttled by enabling partitioning on a queue or topic.
Issue for a Windows Azure Dependency
Other components within Windows Azure can occasionally have service issues. For example, when a system that Service Bus uses is being upgraded, that system may temporarily have reduced capabilities. To work around these types of issues, Service Bus regularly investigates and implements mitigations. Side effects of these mitigations do appear. For example, to handle transient issues with storage, Service Bus implements a system that allows message send operations to work consistently. Due to the nature of the mitigation, a sent message can take up to 15 minutes to appear in the affected queue or subscription and be ready for a receive operation. Generally speaking, most entities will not experience this issue. However, given the number of entities in Service Bus within Windows Azure, this mitigation is sometimes needed for a small subset of Service Bus customers.
Service Bus Failure on a Single Subsystem
With any application, circumstances can cause an internal component of Service Bus can become inconsistent. When Service Bus detects this, it collects data from the application to aid in diagnosing what happened. Once the data is collected, the application is restarted in an attempt to return it to a consistent state. This process happens fairly quickly, and results in an entity appearing to be unavailable for up to a few minutes, though typical downtimes are much shorter.
In these cases, the client application generates a TimeoutException or MessagingException exception. The Service Bus .NET SDK contains a mitigation for this issue in the form of automated client retry logic. Once the retry period is exhausted and the message is not delivered, you can explore using other features such as paired namespaces. Paired namespaces have other caveats which are discussed later in this document.
Failure of Service Bus within an Azure Data Center
The most probable reason for a failure in an Azure Data Center is a failed upgrade deployment of Service Bus or a dependent system. As the platform has matured, the likelihood of this type of failure has diminished. A data center failure can also happen for reasons that include the following:
Electrical outage (power supply and generating power disappear).
Connectivity (Internet break between your clients and Windows Azure).
In both cases, a natural or man-made disaster caused the issue. To work around this and make sure that you can still send messages, you can use paired namespaces to allow messages to be sent to a second location while the primary location is made healthy again.
The paired namespaces feature supports scenarios in which a Service Bus entity or deployment within a data center becomes unavailable. While this event occurs infrequently, distributed systems still must be prepared to handle worst case scenarios. Typically, this event happens because some element on which Service Bus depends is experiencing a short-lived issue. To maintain application availability during an outage, Service Bus users can use two separate namespaces, preferably in separate data centers, to host their messaging entities. The remainder of this section uses the following terminology:
Primary namespace: The namespace your application interacts with for send and receive operations.
Secondary namespace: The namespace that acts as a backup to the primary namespace. Application logic does not interact with this namespace.
Failover interval: The amount of time to accept normal failures before the application switches from the primary namespace to the secondary namespace.
Paired namespaces support send availability. Send availability focuses on preserving the ability to send messages. To use send availability, your application must meet the following requirements:
Messages are only received from the primary namespace.
Messages sent to a given queue/topic might arrive out of order.
If your application uses sessions:
Messages within a session might arrive out of order. This is a break from normal functionality of sessions. This means that your application uses sessions to logically group messages.
Session state is only maintained on the primary namespace.
- Messages within a session might arrive out of order. This is a break from normal functionality of sessions. This means that your application uses sessions to logically group messages.
The primary queue can come online and start accepting messages before the secondary queue delivers all messages into the primary queue.
This section discusses the API, how the APIs are implemented, and shows sample code that uses the feature. Note that there are billing implications associated with this feature.
The MessagingFactory.PairNamespaceAsync API
The paired namespaces feature introduces the following new method on the MessagingFactory class:
When the task completes, the namespace pairing is also complete and ready to act upon for any MessageReceiver, QueueClient, or TopicClient created with the MessagingFactory. PairedNamespaceOptions is the base class for the different types of pairing that are available with a MessagingFactory. Currently, the only derived class is one named SendAvailabilityPairedNamespaceOptions, which implements the send availability requirements. SendAvailabilityPairedNamespaceOptions has a set of constructors which all build on each other. Looking at the constructor with the most parameters, you can understand the behavior of the other constructors.
public SendAvailabilityPairedNamespaceOptions( NamespaceManager secondaryNamespaceManager, MessagingFactory messagingFactory, int backlogQueueCount, TimeSpan failoverInterval, bool enableSyphon)
These parameters have the following meanings:
secondaryNamespaceManager: An initialized NamespaceManager instance for the secondary namespace that the PairNamespaceAsync method can use to set up the secondary namespace. The manager will be used to obtain the list of queues in the namespace and make sure that the required backlog queues exist. If those queues do not exist, they will be created. The NamespaceManager requires the ability to create a token with the Manage claim.
messagingFactory: The MessagingFactory instance for the secondary namespace. The MessagingFactory is used to send and, if the EnableSyphon property is set to true, to receive messages from the backlog queues.
backlogQueueCount: The number of backlog queues to create. This value must be at least 1. When sending messages to the backlog, one of these queues is randomly chosen. If you set the value to 1, then only one queue can ever be used. When this happens and the one backlog queue generates errors, the client is not be able to try a different backlog queue and may fail to send your message. We recommend setting this value to some larger value and default the value to 10. You can change this to a higher or lower value depending on how much data your application sends per day. Each backlog queue can hold up to 5 GB of messages.
failoverInterval: The amount of time that you will accept failures on the primary namespace before switching any single entity over to the secondary namespace. Failovers occur on an entity-by-entity basis. Entities in a single namespace frequently live in different nodes within Service Bus. A failure in one entity does not imply a failure in another. You can set this value to Zero to failover to the secondary immediately after your first, non-transient failure. Failures that trigger the failover timer are any MessagingException in which the IsTransient property is false, or a TimeoutException. Other exceptions, such as UnauthorizedAccessException will not cause failover, because they indicate that the client is configured incorrectly. A ServerBusyException does not cause failover because the correct pattern is to wait 10 seconds, then send the message again.
enableSyphon: Indicates that this particular pairing should also syphon messages from the secondary namespace back to the primary namespace. In general, applications that send messages should set this value to false; applications that receive messages should set this value to true. The reason for this is that frequently, there are fewer message receivers than message senders. Depending on the number of receivers, you can choose to have a single application instance handle the syphon duties. Using many receivers has billing implications for each backlog queue.
To use the code, create a primary MessagingFactory instance, a secondary MessagingFactory instance, a secondary NamespaceManager instance, and a SendAvailabilityPairedNamespaceOptions instance. The call can be as simple as the following:
SendAvailabilityPairedNamespaceOptions sendAvailabilityOptions = new SendAvailabilityPairedNamespaceOptions(secondaryNamespaceManager, secondary); primary.PairNamespaceAsync(sendAvailabilityOptions).Wait();
When the task returned by the PairNamespaceAsync method completes, everything will be set up and ready to use. Before the task is returned, you may not have completed all of the background work necessary for the pairing to work right. As a result, you should not start sending messages until the task returns. If any failures occurred, such as bad credentials, or failure to create the backlog queues, those exceptions will be thrown once the task completes. Once the task returns, verify that the queues were found or created by examining the BacklogQueueCount property on your SendAvailabilityPairedNamespaceOptions instance. For the preceding code, that operation appears as follows: