Retry Pattern

Article
08/26/2015

Enable an application to handle anticipated, temporary failures when it attempts to connect to a service or network resource by transparently retrying an operation that has previously failed in the expectation that the cause of the failure is transient. This pattern can improve the stability of the application.

Context and Problem

An application that communicates with elements running in the cloud must be sensitive to the transient faults that can occur in this environment. Such faults include the momentary loss of network connectivity to components and services, the temporary unavailability of a service, or timeouts that arise when a service is busy.

These faults are typically self-correcting, and if the action that triggered a fault is repeated after a suitable delay it is likely to be successful. For example, a database service that is processing a large number of concurrent requests may implement a throttling strategy that temporarily rejects any further requests until its workload has eased. An application attempting to access the database may fail to connect, but if it tries again after a suitable delay it may succeed.

Solution

In the cloud, transient faults are not uncommon and an application should be designed to handle them elegantly and transparently, minimizing the effects that such faults might have on the business tasks that the application is performing.

If an application detects a failure when it attempts to send a request to a remote service, it can handle the failure by using the following strategies:

If the fault indicates that the failure is not transient or is unlikely to be successful if repeated (for example, an authentication failure caused by providing invalid credentials is unlikely to succeed no matter how many times it is attempted), the application should abort the operation and report a suitable exception.
If the specific fault reported is unusual or rare, it may have been caused by freak circumstances such as a network packet becoming corrupted while it was being transmitted. In this case, the application could retry the failing request again immediately because the same failure is unlikely to be repeated and the request will probably be successful.
If the fault is caused by one of the more commonplace connectivity or “busy” failures, the network or service may require a short period while the connectivity issues are rectified or the backlog of work is cleared. The application should wait for a suitable time before retrying the request.

For the more common transient failures, the period between retries should be chosen so as to spread requests from multiple instances of the application as evenly as possible. This can reduce the chance of a busy service continuing to be overloaded. If many instances of an application are continually bombarding a service with retry requests, it may take the service longer to recover.

If the request still fails, the application can wait for a further period and make another attempt. If necessary, this process can be repeated with increasing delays between retry attempts until some maximum number of requests have been attempted and failed. The delay time can be increased incrementally, or a timing strategy such as exponential back-off can be used, depending on the nature of the failure and the likelihood that it will be corrected during this time.

Figure 1 illustrates this pattern. If the request is unsuccessful after a predefined number of attempts, the application should treat the fault as an exception and handle it accordingly.

Figure 1 - Invoking an operation in a hosted service using the Retry pattern

The application should wrap all attempts to access a remote service in code that implements a retry policy matching one of the strategies listed above. Requests sent to different services can be subject to different policies, and some vendors provide libraries that encapsulate this approach. These libraries typically implement policies that are parameterized, and the application developer can specify values for items such as the number of retries and the time between retry attempts.

The code in an application that detects faults and retries failing operations should log the details of these failures. This information may be useful to operators. If a service is frequently reported as unavailable or busy, it is often because the service has exhausted its resources. You may be able to reduce the frequency with which these faults occur by scaling out the service. For example, if a database service is continually overloaded, it may be beneficial to partition the database and spread the load across multiple servers.

Note

Microsoft Azure provides extensive support for the Retry pattern. The patterns & practices Transient Fault Handling Block enables an application to handle transient faults in many Azure services using a range of retry strategies. The Microsoft Entity Framework version 6 provides facilities for retrying database operations. Additionally, many of the Azure Service Bus and Azure Storage APIs implement retry logic transparently.

Issues and Considerations

You should consider the following points when deciding how to implement this pattern:

The retry policy should be tuned to match the business requirements of the application and the nature of the failure. It may be better for some noncritical operations to fail fast rather than retry several times and impact the throughput of the application. For example, in an interactive web application that attempts to access a remote service, it may be better to fail after a smaller number of retries with only a short delay between retry attempts, and display a suitable message to the user (for example, “please try again later”) to prevent the application from becoming unresponsive. For a batch application, it may be more appropriate to increase the number of retry attempts with an exponentially increasing delay between attempts.
A highly aggressive retry policy with minimal delay between attempts, and a large number of retries, could further degrade a busy service that is running close to or at capacity. This retry policy could also affect the responsiveness of the application if it is continually attempting to perform a failing operation rather than doing useful work.
If a request still fails after a significant number of retries, it may be better for the application to prevent further requests going to the same resource for a period and simply report a failure immediately. When the period expires, the application may tentatively allow one or more requests through to see whether they are successful. For more details of this strategy, see the Circuit Breaker pattern.
The operations in a service that are invoked by an application that implements a retry policy may need to be idempotent. For example, a request sent to a service may be received and processed successfully but, due to a transient fault, it may be unable to send a response indicating that the processing has completed. The retry logic in the application might then attempt to repeat the request on the assumption that the first request was not received.
A request to a service may fail for a variety of reasons and raise different exceptions, depending on the nature of the failure. Some exceptions may indicate a failure that could be resolved very quickly, while others may indicate that the failure is longer lasting. It may be beneficial for the retry policy to adjust the time between retry attempts based on the type of the exception.
Consider how retrying an operation that is part of a transaction will affect the overall transaction consistency. It may be useful to fine tune the retry policy for transactional operations to maximize the chance of success and reduce the need to undo all the transaction steps.
Ensure that all retry code is fully tested against a variety of failure conditions. Check that it does not severely impact the performance or reliability of the application, cause excessive load on services and resources, or generate race conditions or bottlenecks.
Implement retry logic only where the full context of a failing operation is understood. For example, if a task that contains a retry policy invokes another task that also contains a retry policy, this extra layer of retries can add long delays to the processing. It may be better to configure the lower-level task to fail fast and report the reason for the failure back to the task that invoked it. This higher-level task can then decide how to handle the failure based on its own policy.
It is important to log all connectivity failures that prompt a retry so that underlying problems with the application, services, or resources can be identified.
Investigate the faults that are most likely to occur for a service or a resource to discover if they are likely to be long lasting or terminal. If this is the case, it may be better to handle the fault as an exception. The application can report or log the exception, and then attempt to continue either by invoking an alternative service (if there is one available), or by offering degraded functionality. For more information on how to detect and handle long-lasting faults, see the Circuit Breaker pattern.

When to Use this Pattern

Use this pattern:

When an application could experience transient faults as it interacts with a remote service or accesses a remote resource. These faults are expected to be short lived, and repeating a request that has previously failed could succeed on a subsequent attempt.

This pattern might not be suitable:

When a fault is likely to be long lasting, because this can affect the responsiveness of an application. The application may simply be wasting time and resources attempting to repeat a request that is most likely to fail.
For handling failures that are not due to transient faults, such as internal exceptions caused by errors in the business logic of an application.
As an alternative to addressing scalability issues in a system. If an application experiences frequent “busy” faults, it is often an indication that the service or resource being accessed should be scaled up.

Example

This example illustrates an implementation of the Retry pattern. The OperationWithBasicRetryAsync method, shown below, invokes an external service asynchronously through the TransientOperationAsync method (the details of this method will be specific to the service and are omitted from the sample code).

private int retryCount = 3;...public async Task OperationWithBasicRetryAsync(){  int currentRetry = 0;  for (; ;)  {    try    {      // Calling external service.      await TransientOperationAsync();                          // Return or break.      break;    }    catch (Exception ex)    {      Trace.TraceError("Operation Exception");      currentRetry++;      // Check if the exception thrown was a transient exception      // based on the logic in the error detection strategy.      // Determine whether to retry the operation, as well as how       // long to wait, based on the retry strategy.      if (currentRetry > this.retryCount || !IsTransient(ex))      {        // If this is not a transient error         // or we should not retry re-throw the exception.         throw;      }    }    // Wait to retry the operation.    // Consider calculating an exponential delay here and     // using a strategy best suited for the operation and fault.    Await.Task.Delay();  }}// Async method that wraps a call to a remote service (details not shown).private async Task TransientOperationAsync(){  ...}

The statement that invokes this method is encapsulated within a try/catch block wrapped in a for loop. The for loop exits if the call to the TransientOperationAsync method succeeds without throwing an exception. If the TransientOperationAsync method fails, the catch block examines the reason for the failure, and if it is deemed to be a transient error the code waits for a short delay before retrying the operation.

The for loop also tracks the number of times that the operation has been attempted, and if the code fails three times the exception is assumed to be more long lasting. If the exception is not transient or it is longlasting, the catch handler throws an exception. This exception exits the for loop and should be caught by the code that invokes the OperationWithBasicRetryAsync method.

The IsTransient method, shown below, checks for a specific set of exceptions that are relevant to the environment in which the code is run. The definition of a transient exception may vary according to the resources being accessed and the environment in which the operation is being performed.

private bool IsTransient(Exception ex){  // Determine if the exception is transient.  // In some cases this may be as simple as checking the exception type, in other   // cases it may be necessary to inspect other properties of the exception.  if (ex is OperationTransientException)    return true;  var webException = ex as WebException;  if (webException != null)  {    // If the web exception contains one of the following status values     // it may be transient.    return new[] {WebExceptionStatus.ConnectionClosed,                   WebExceptionStatus.Timeout,                   WebExceptionStatus.RequestCanceled }.            Contains(webException.Status);  }  // Additional exception checking logic goes here.  return false;}

The following pattern may also be relevant when implementing this pattern:

Circuit Breaker Pattern. The Retry pattern is ideally suited to handling transient faults. If a failure is expected to be more long lasting, it may be more appropriate to implement the Circuit Breaker Pattern. The Retry pattern can also be used in conjunction with a circuit breaker to provide a comprehensive approach to handling faults.

More Information

The Transient Fault Handling Application Block on MSDN.
The article Connection Resiliency / Retry Logic (EF6 onwards) on MSDN.

Next Topic | Previous Topic | Home | Community