Exportera (0) Skriv ut
Visa allt
EN
Det här innehållet finns inte tillgängligt på ditt språk men här finns den engelska versionen,

Failsafe: Guidance for Resilient Cloud Architectures

Updated: June 25, 2014

Authors: Marc Mercuri, Ulrich Homann, Mark Simms, Jason Wescott and Andrew Townhill

Reviewers: Michael Thomassy, Curt Peterson, Stuart Ozer, Fran Dougherty, Tim O’Brien, Hatay Tuna, Patrick Butler Monterde, Mark Kottke, Lewis Curtis, William Bellamy

Publish date: June 2014

Introduction

Fail-safe noun. Something designed to work or function automatically to prevent breakdown of a mechanism, system, or the like.

Individuals - whether in the context of employee, citizen, or consumer – demand instant access to application, compute and data services. The number of people connected and the devices they use to connect to these services are ever growing. In this world of always-on services, the systems that support them must be designed to be both available and resilient.

These services will be deployed in cloud environments that are populated with commodities in pre-defined configurations. Historically, you may have purchased higher-end hardware to scale up; in cloud environments, you must scale out instead. You keep costs for these cloud environments low by using commodity hardware. Commodity hardware will fail, and the cloud requires the architecture to truly embrace failure. Historically, you may have focused on preventing failures and optimizing “Mean Time Between Failures”. In this new environment, the focus shifts to one of “Mean Time to Restore with Minimal Impact.”

It is highly likely that the services you develop will be composites. They will be composed of one or more first-party or third-party platforms and third-party provider services. These services will be built on cloud infrastructure that will fail. The architecture must also assume that the services that they consume will also fail. As with the infrastructure architecture, the application architecture design must embrace failure.

The Fail-Safe initiative within Microsoft is intended to deliver general guidance for building resilient cloud architectures, guidance for implementing those architectures on Microsoft technologies, and recipes for implementing these architectures for specific scenarios. The authors of this document are members of Microsoft’s Cloud + Enterprise division, Trustworthy Computing and Microsoft Consulting Services.

This document focuses on the architectural considerations for designing scalable and resilient systems.

This paper is organized into the following sections:

  • Decompose the Application by Workload: Defining how a workload-centric approach provides better controls over costs, more flexibility in choosing technologies best suited to the workload, and enables a more finely tuned approach to availability and resiliency.

  • Establish a Lifecycle Model: Establishing an application lifecycle model helps define the expected behavior of an application in production and will provide requirements and insight for the overall architecture.

  • Establish an Availability Model and Plan: The availability model identifies the level of availability that is expected for your workload. It is critical as it will inform many of the decisions you’ll make when establishing your service.

  • Identify Failure Points, Failure Modes, and Failure Effects: To create a resilient architecture, it’s important to understand and identify failure points and modes. Specifically, making a proactive effort to understand and document what can cause an outage will establish an outline that can be used in analysis and planning.

  • Resiliency Patterns and Considerations: This section represents the majority of the document, and contains key considerations across compute, storage, and platform services. These considerations focus on proven practices to deliver a healthy application at key considerations across compute, storage, and platform services.

  • Design for Operations: In a world that expects services to be “always on”, it’s important that services be designed for operations. This section looks at proven practices for designing for operations that span the lifecycle, including establishing a health model to implementing telemetry to visualizing that telemetry information for the operations and developer audiences.

Decompose the Application by Workload

Applications are typically composed of multiple workloads.

Different workloads can, and often do, have different requirements, different levels of criticality to the business, and different levels of financial consideration associated with them. By decomposing an application into workloads, an organization provides itself with valuable flexibility. A workload-centric approach provides better controls over costs, more flexibility in choosing technologies best suited to the workload, workload specific approaches to availability and security, flexibility and agility in adding and deploying new capabilities, etc.

Scenarios

When thinking about resiliency, it’s sometimes helpful to do so in the context of scenarios. The following are examples of typical scenarios:

  • Sports Data Service

    A customer provides a data service that provides sports information. The service has two primary workloads. The first provides statistics for the player and teams. The second provides scores and commentary for games that are currently in progress.

  • E-Commerce Web Site

    An online retailer sells goods via a website in a well-established model. The application has a number of workloads, with the most popular being “search and browse” and “checkout.”

  • Social

    A high profile social site allows members of a community to engage in shared experiences around forums, user generated content, and casual gaming. The application has a number of workloads, including registration, search and browse, social interaction, gaming, email, etc.

  • Web

    An organization wishes to provide an experience to customers via its web site. The application needs to deliver experiences on both PC-based browsers as well as popular mobile device types (phone, tablet) The application has a number of workloads including registration, search and browse, content publishing, social commenting, moderation, gaming, etc.

Example of Decomposing by Workload

Let’s take a closer look at one of the scenarios and decompose it into its child workloads. An ecommerce website, could have a number of workloads – browse & search, checkout & management, user registration, user generated content (reviews and rating), personalization, etc.

Example definitions of two of the core workloads for the scenario would be:

  • Browse & Search enables customers to navigate through a product catalog, search for specific items, and perhaps manage baskets or wish lists. This workload can have attributes such as anonymous user access, sub-second response times, and caching. Performance degradation may occur in the form of increased response times with unexpected user load or application-tolerant interrupts for product inventory refreshes. In those cases, the application may choose to continue to serve information from the cache.

  • Checkout & Management helps customers place, track, and cancel orders; select delivery methods and payment options; and manage profiles. This workload can have attributes such as secure access, queued processing, access to third-party payment gateways, and connectivity to back-end on-premise systems. While the application may tolerate increased response time, it may not tolerate loss of orders; therefore, it is designed to guarantee that customer orders are always accepted and captured, regardless of whether the application can process the payment or arrange delivery.

Establish a Lifecycle Model

An application lifecycle model defines the expected behavior of an application when operational. At different phases and times, an application will put different demands on the system whether at a functional or scale level. The lifecycle model(s) will reflect this.

Workloads should have defined lifecycle models for all relevant and applicable scenarios at appropriate levels of granularity. Services may have hourly, daily, weekly, or seasonal lifecycle differences that, when modeled, identify specific capacity, availability, performance, and scalability requirements over time.

Failsafe_03

Figure 1. Lifecycle examples across different industries and scenarios

There will often be periods of time that have their own unique lifecycle such as:

  • A spike related to peak demand during a holiday period.

  • Increased filing of tax returns just before their due date.

  • Morning and afternoon commuter time windows.

  • The end-of-year filing of employee performance reviews.

Many organizations have an understanding of these types of scenarios and the related scenario-specific lifecycles.

Decomposition allows you to have different internal Service Level Agreements (SLAs) at the workload level. An example of this is the Sports Data API example that had a target SLA of 99.99%. However, you can split that API into two workloads: “Live Scores + Commentary” and “Team, Player, and League Statistics”

For the “Live Scores + Commentary” workload, the lifecycle has an “off and on” pattern. However, the availability of “Team, Player, and League Statistics” will be constant. Decomposition by workload allows you to have SLAs tailored to the availability needs of the aggregated workload of the composite service.

Failsafe_12

Figure 2

Establish an Availability Model and Plan

Once a lifecycle model is identified, the next step is to establish an availability model and plan. An availability model for your application identifies the level of availability that is expected for your workload. It is critical as it will inform many of the decisions you’ll make when establishing your service.

There are a number of things consider and a number of potential actions that can be taken.

SLA Identification

When developing your availability plan, it’s important to understand what the desired availability is for your application, the workloads within that application, and the services that are utilized in the delivery of those workloads.

Defining the Desired SLA for Your Workload

Understanding the lifecycle of your workload will help you choose the SLA that you want to deliver. An SLA may not be provided for your service publicly. However, your architecture should target an availability baseline that you will aspire to meet.

Depending on the type of solution you are building, there will be a number of considerations and options for delivering higher availability. Commercial service providers do not offer 100% SLAs because the complexity and cost to deliver that level of SLA is unfeasible or unprofitable. Decomposing your application to the workload level allows you to make decisions and implement approaches for availability. Delivering 99.99% uptime for your entire application may be unfeasible, but for a workload in an application, it is achievable.

Even at the workload level, you may not choose to implement every option. What you choose to implement or not is determined by your requirements. Regardless of the options you do choose, you should make a conscious choices that’s informed and considerate of all of the options.

Autonomy

Autonomy is about independence and reducing dependency between the parts which make up the service as a whole. Dependency on components, data, and external entities must be examined when designing services, with an eye toward building related functionality into autonomous units within the service. Doing so provides the agility to update versions of distinct autonomous units, finer tuned control of scaling these autonomous units, etc.

Workload architectures are often composed of autonomous components that do not rely on manual intervention, and do not fail when the entities they depend upon are not available. Applications composed of autonomous parts are:

  • available and operational

  • resilient and easily fault-recoverable

  • lower-risk for unhealthy failure states

  • easy to scale through replication

  • less likely to require manual interventions

These autonomous units will often leverage asynchronous communication, pull-based data processing, and automation to ensure continuous service.

Looking forward, the market will evolve to a point where there are standardized interfaces for certain types of functionality for both vertical and horizontal scenarios. When this future vision is realized, a service provider will be able to engage with different providers and potentially different implementations that solve the designated work of the autonomous unit. For continuous services, this will be done autonomously and be based on policies.

As much as autonomy is an aspiration, most services will take a dependency on a third party service – if only for hosting. It’s imperative to understand the SLAs of these dependent services and incorporate them into your availability plan.

Understanding the SLAs and Resiliency Options for Service Dependencies

This section identifies the different types of SLAs that can be relevant to your service. For each of these service types, there are key considerations and approaches, as well as questions that should be asked.

Public Cloud Platform Services

Services provided by a commercial cloud computing platform, such as compute or storage, have service level agreements that are designed to accommodate a multitude of customers at significant scale. As such, the SLAs for these services are non-negotiable. A provider may provide tiered levels of service with different SLAs, but these tiers will be non-negotiable. The service provider uses tiered service levels with different SLAs to deliver a predictable quality of service.

Questions to consider for this type of service:

  • Does this service allow only a certain number of calls to the Service API?

  • Does this service place limit on the call frequency to the Service API?

  • Does the service limit the number of servers that can call the Service API?

  • What is the publicly available information on how the service delivers on its availability promise?

  • How does this service communicate its health status?

  • What is the stated Service Level Agreement (SLA)?

  • What are the equivalent platform services provided by other 3rd parties?

3rd Party “Free” Services

Many third parties provide “free” services to the community. For private sector organizations, this is largely done to help generate an ecosystem of applications around their core product or service. For public sector, this is done to provide data to the citizenry and businesses that have ostensibly have paid for its collection through the funding of the government through taxes.

Most of these services will not come with service level agreements, so availability is not guaranteed. When SLAs are provided, they typically focus on restrictions that are placed on consuming applications and mechanisms that will be used to enforce them. Examples of restrictions can include throttling or blacklisting your solution if it exceeds a certain number of service calls, exceeds a certain number of calls in a given time period (x per minute), or exceeds the number of allowable servers that are calling the service.

Questions to consider for this type of service:

  • Does this service allow only a certain number of calls to the Service API?

  • Does this service place limits on the call frequency to the Service API?

  • Does the service limit the number of servers that can call the Service API?

  • What is the publicly available information on how the service delivers on its availability promise?

  • How does this service communicate its health status?

  • What is the stated Service Level Agreement (SLA)?

  • Is this a commodity service where the required functionality and/or data are available from multiple service providers?

  • If a commodity service, is the interface interoperable across other service providers (directly or through an available abstraction layer)?

  • What are the equivalent platform services provided by other 3rd parties?

3rd Party Commercial Services

Commercial services provided by third parties have service level agreements that are designed to accommodate the needs of paying customers. A provider may provide tiered levels of SLAs with different levels of availability, but these SLAs will be non-negotiable.

Questions to consider for this type of service:

  • Does this service allow only a certain number of calls to the Service API?

  • Does this service place limits on the call frequency to the Service API?

  • Does the service limit the number of servers that can call the Service API?

  • What is the publicly available information on how the service delivers on its availability promise?

  • How does this service communicate its health status?

  • What is the stated Service Level Agreement (SLA)?

  • Is this a commodity service where the required functionality and/or data are available from multiple service providers?

  • If a commodity service, is the interface interoperable across other service providers (directly or through an available abstraction layer)?

  • What are the equivalent platform services provided by other 3rd parties?

Community Cloud Services

A community of organizations, such as a supply chain, may make services available to member organizations.

Questions to consider for this type of service:

  • Does this service allow only a certain number of calls to the Service API?

  • Does this service place limits on the call frequency to the Service API?

  • Does the service limit the number of servers that can call the Service API?

  • What is the publicly available information on how the service delivers on its availability promise?

  • How does this service communicate its health status?

  • What is the stated Service Level Agreement (SLA)?

  • As a member of the community, is there a possibility of negotiating a different SLA?

  • Is this a commodity service where the required functionality and/or data are available from multiple service providers?

  • If a commodity service, is the interface interoperable across other service providers (directly or through an available abstraction layer)?

  • What are the equivalent platform services provided by other 3rd parties?

1st Party Internal Enterprise Wide Cloud Services

An enterprise may make core services, such as stock price data or product metadata, available to its divisions and departments.

Questions to consider for this type of service:

  • Does this service allow only a certain number of calls to the Service API?

  • Does this service place limits on the call frequency to the Service API?

  • Does the service limit the number of servers that can call the Service API?

  • What is the publicly available information on how the service delivers on its availability promise?

  • How does this service communicate its health status?

  • What is the stated Service Level Agreement (SLA)?

  • As a member of the organization, is there a possibility of negotiating a different SLA?

  • Is this a commodity service where the required functionality and/or data are available from multiple service providers?

  • If a commodity service, is the interface interoperable across other service providers (directly or through an available abstraction layer)?

  • What are the equivalent platform services provided by other 3rd parties?

1st Party Internal Divisional or Departmental Cloud Services

An enterprise division or department may make services available to other members of their immediate organization.

Questions to consider for this type of service:

  • Does this service allow only a certain number of calls to the Service API?

  • Does this service place limits on the call frequency to the Service API?

  • Does the service limit the number of servers that can call the Service API?

  • What is the publicly available information on how the service delivers on its availability promise?

  • How does this service communicate its health status?

  • What is the stated Service Level Agreement (SLA)?

  • As a member of the division, is there a possibility of negotiating a different SLA?

  • Is this a commodity service where the required functionality and/or data are available from multiple service providers?

  • If a commodity service, is the interface interoperable across other service providers (directly or through an available abstraction layer)?

  • What are the equivalent platform services provided by other 3rd parties?

The “True 9s” of Composite Service Availability

Taking advantage of existing services can provide significant agility in delivering solutions for your organization or for commercial sale. While attractive, it is important to truly understand the impacts these dependencies have on the overall SLA for the workload.

Availability is typically expressed as a percentage of uptime in a given year. This is expressed availability percentage is referred to as the number of “9s.” For example, 99.9 represents a service with “three nines” and 99.999 represents a service with “five nines.”

 

Availability %

Downtime per year

Downtime per month

Downtime per week

90% ("one nine")

36.5 days

72 hours

16.8 hours

95%

18.25 days

36 hours

8.4 hours

97%

10.96 days

21.6 hours

5.04 hours

98%

7.30 days

14.4 hours

3.36 hours

99% ("two nines")

3.65 days

7.20 hours

1.68 hours

99.5%

1.83 days

3.60 hours

50.4 minutes

99.8%

17.52 hours

86.23 minutes

20.16 minutes

99.9% ("three nines")

8.76 hours

43.2 minutes

10.1 minutes

99.95%

4.38 hours

21.56 minutes

5.04 minutes

99.99% ("four nines")

52.56 minutes

4.32 minutes

1.01 minutes

99.999% ("five nines")

5.26 minutes

25.9 seconds

6.05 seconds

99.9999% ("six nines")

31.5 seconds

2.59 seconds

0.605 seconds

99.99999% ("seven nines")

3.15 seconds

.259 seconds

0.0605 seconds

One common misconception is related to the number of “9s” for a composite service provides. Specifically, it is often assumed that if a given service is composed of 5 services, each with a promised 99.99% uptime in their SLAs, that the resulting composite service has availability of 99.99%. This is not the case.

Failsafe_13

Figure 3: Downtime related to the more common “9s”

The composite SLA is actually a calculation which considers the amount of downtime per year. A service with an SLA of “four 9s” (99.99%) can be offline up to 52.56 minutes.

Incorporating 5 services with a 99.99% SLA into a composite introduces an identified SLA risk of 262.8 minutes or 4.38 hours. This reduces the availability to 99.95% before a single line of code is written!

It is true that, generally, you cannot change the availability of a third-party service. However, when writing your code, you can increase the overall availability of your application using techniques like the ones described in this document.

noteNote
Some services may provide different tiers of service for different price points or based on business partnerships.

Consider the Sports Data API we previously referenced. Users only play games on certain days and only for a certain number of hours on those days. During these periods, the SLA would be 100%. When there are no games scheduled, this workload has a 0% SLA. The “Team, Player, League Statistics” workload has a more consistent lifecycle pattern. There is also a requirement for that workload to have a 99% SLA all the time.

Failsafe_14

Figure 4

When leveraging external services, we cannot stress the importance of understanding SLAs, individually and their impact on the composite, enough.

Identify Failure Points, Failure Modes, and Failure Effects

To create a resilient architecture, it’s important to understand it. Specifically, making a proactive effort to understand and document what can cause an outage.

Understanding the failure points and failure modes for an application and its related workload services can enable you to make informed, targeted decisions on strategies for resiliency and availability.

Failure Points

A failure points are locations where failures may result in a service interruption. An important focus is on design elements that are subject to external change.

Examples of failure points include;

  • Database connections

  • Website connections

  • Configuration files

  • Registry keys

Categories of common failure points include;

  • ACLs

  • Database access

  • External web site/service access

  • Transactions

  • Configuration

  • Capacity

  • Network

Failure Modes

While failure points define the areas that can result in an outage, failure modes identify the specific manner in which a failure can occur.

Examples of failure modes include;

  • A missing configuration file

  • Significant traffic exceeding resource capacity

  • A database reaching maximum capacity

Failure Effects

Failure effects are the consequences of failure on functionality. Learn about the effects of failure and the frequency at which these types of failures are likely to occur. By doing so, you can prioritize when and how you address the failure points and failure modes of your application or service.

Resiliency Patterns and Considerations

This document will look at key considerations across compute, storage, and platform services. Before covering these topics, it is important to recap several basic resiliency impacting topics that are often either misunderstood and/or not implemented.

Default to Asynchronous

As mentioned previously, a resilient architecture should optimize for autonomy. One of the ways to achieve autonomy is by making communication asynchronous. A resilient architecture should default to asynchronous interaction, with synchronous interactions happening only as the result of an exception.

Stateless web-tiers or web-tiers with a distributed cache can provide this on the front end of a solution. Queues can provide this capability for communication for interaction between workload services or for services within a workload service.

The latter allows messages to be placed on queues and secondary services can retrieve them. This can be done based on logic, time, or volume considerate logic. In addition to making the process asynchronous, it also allows scaling of tiers “pushing” or “pulling” from the queues as appropriate.

Time-outs

A common area where transient faults will occur is where your architecture connects to a service or a resource such as a database. When consuming these services, it’s a common practice to implement logic that introduces the concept of a time out. This logic identifies an acceptable timeframe in which a response is expected and will generate an identifiable error when exceeding that time frame. Based on the appearance of the timeout error, appropriate steps will be taken based on the context in which the error occurs. Context can include the number of times this error has occurred, the potential impact of the unavailable resource, SLA guarantees for the current time period for the given customer, etc.

Handle Transient Faults

When designing the service(s) that will deliver your workload, you must accept and embrace that failures will occur and take the appropriate steps to address them.

One of the common areas to address is transient faults. As no service has 100% uptime, it’s realistic to expect that you may not be able to connect to a service that a workload has taken a dependency on. The inability to connect to or faults seen from one of these services may be fleeting (less than a second) or permanent (a provider shuts down).

Degrade Gracefully

Your workload service should aspire to handle these transient faults gracefully. Netflix, for example, during an outage at their cloud provider utilized an older video queue for customers when the primary data store was not available. Another example would be an ecommerce site continuing to collect orders if its payment gateway is unavailable. This provides the ability to process orders when the payment gateway is once again available or after failing over to a secondary payment gateway.

High Level Considerations for Graceful Degradation

When thinking about how to gracefully degrade, the following are key considerations:

  • Components that do not have the full context of the request should simply fail and pass on the exception message. To resolve this, implement a circuit breaker, described later in this document, to fail fast when you reach failure count thresholds.

  • Failures that may be transient in nature are common. Handle them using the Retry pattern described later in this document.

  • The caller may be able to correct some failed requests by retrying the request using different parameters or a different path.

  • If the success of the request is not critical to the scenario, handle the failure using simple omission.

  • You can handle failures by returning a success message and queuing the failed requests for processing later when the resource returns to a normal state.

  • You may be able to return something that was previously correct, for example, cached credentials, stale data from cache, etc.

  • You can handle some failures by providing an experience that is nearly correct, for example, temporary access, approximated values, best guess, noscript, etc.

  • Rather than throw an error, you may be able to provide some alternative, for example, random values, anonymous rights, default images, etc.

Transient Fault Handling Considerations

There are several key considerations for the implementation of transient fault handling, as detailed in the following sections.

  • Retry logic

    The simplest form of transient fault handling is to retry the operation that failed. If using a commercial third party service, implementing “retry logic” will often resolve this issue.

    It should be noted that designs should typically limit the number of times the logic will be retried. The logic will typically attempt to execute the action(s) a certain number of times, registering an error and/or utilizing a secondary service or workflow if the fault continues.

  • Exponential Backoff

    If the result of the transient fault is due to throttling by the service due to heavy load, repeated attempts to call the service will only extend the throttling and impact overall availability.

    It is often desirable to reduce the volume of the calls to the service to help avoid or reduce throttling. This is typically done algorithmically, such as immediately retrying after the first failure, waiting 1 second after the second failure, 5 seconds after the 3rd failure, etc. until ultimately succeeding or a hitting an application defined threshold for failures.

    This approach is referred to “exponential backoff.”

  • Idempotency

    A core assumption with connected services is that they will not be 100% available and that transient fault handling with retry logic is a core implementation approach. In cases where retry logic is implemented, there is the potential for the same message to be sent more than once, for messages to be sent out of sequence, etc.

    Operations should be designed to be idempotent, ensuring that sending the same message multiple times does not result in an unexpected or polluted data store.

    For example, inserting data from all requests may result in multiple records being added if the service operation is called multiple times. An alternate approach would be to implement the code as an intelligent ‘upsert’. A timestamp or global identifier could be used to identify new from previously processed messages, inserting only newer ones into the database and updating existing records if the message is newer than what was received in the past.

  • Compensating Behavior

    In addition to idempotency, another area for consideration is the concept compensating behavior. In a world of an every growing set of connected systems and the emergence of composite services, the importance of understanding how to handle the compensating behavior is important.

    For many developers of line of business applications, the concepts of transactions are not new, but the frame of reference is often tied to the transactional functionality exposed by local data technologies and related code libraries. When looking at the concept in terms of the cloud, this mindset needs to take into new considerations related to orchestration of distributed services.

    A service orchestration can span multiple distributed systems and be long running and stateful. The orchestration itself is rarely synchronous, can span multiple systems and can span from seconds to years based on the business scenario.

    In a supply chain scenario that could tie together 25 organizations in the same workload activity, for example, there may be a set of 25 or more systems that are interconnected in one or more service orchestrations.

    If success occurs, the 25 systems must be made aware that the activity was successful. For each connection point in the activity, participant systems can provide a correlation ID for messages it receives from other systems. Depending on the type of activity, the receipt of that correlation ID may satisfy the party that the transaction is notionally complete. In other cases, upon the completion of the interactions of all 25 parties, and confirmation message may be sent to all parties (either directly from a single service or via the specific orchestration interaction points for each system).

    To handle failures in composite and/or distributed activities, each service would expose a service interface and operation(s) to receive requests to cancel a given transaction by a unique identifier. Behind the service façade, workflows would be in place to compensate for the cancellation of this activity. Ideally these would be automated procedures, but they can be as simple as routing to a person in the organization to remediate manually.

Circuit Breaker Pattern

A circuit breaker is a switch that automatically interrupts the flow of electric current if the current exceeds a preset limit. Circuit breakers are used most often as a safety precaution where excessive current through a circuit could be hazardous. Unlike a fuse, a circuit breaker can be reset and re-used.

The same pattern is applicable to software design, and particularly applicable for services where availability and resiliency are a key consideration.

In the case of a resource being unavailable, implementing a software circuit breaker can respond with appropriate action and respond appropriately.

A circuit breaker will have three states: Closed, Open, and Half-open.

The Closed state is the normal state for the application or service. When in a Closed state, requests are routed through the standard path. A counter keeps track of failure rates and evaluates them against a threshold. If that failure rate crosses the threshold, the circuit breaker changes to an Open state.If a call to a database resource failed after 100 consecutive attempts to connect, there is likely little value in continuing to call the database. A circuit breaker could be triggered at that threshold and the appropriate actions can be taken.

When in an Open state, the circuit breaker routes requests to the mitigation path(s). This could be a fail fast or some other form of graceful degradation. When switching to the Open state, the circuit breaker will also initiate a timer. When the timer expires, it will switch to a Half-open state.

The Half-open state will begin to route a limited number of requests through the normal path. If these are successful, the circuit breaker switches to the Closed state. If this limited number of requests fails, the circuit breaker returns to the Open state.

Circuit Breaker Considerations

When including the circuit breaker pattern in your architecture, consider the following:

  • A circuit breaker can invoke different mitigations or different times when in the Open state based on the type of exception thrown from the operation.

  • A circuit breaker should log all transition states. This allows operators to monitor the transition behavior and fine tune the timers to prevent prolonged Open states or frequent oscillations between Open and Half-Open states.

  • Instead of using a timer to transition from Open to Half-open, the circuit breaker could periodically test the standard path to determine if it has become normal again.

  • Be careful about using a circuit breaker when dealing with an operation that leads to multiple shards. The concern here is that health may be at the shard level and can result in two undesirable scenarios:

    • Moving to an Open state when only one shard is failing.

    • Accidentally moving to a Closed state when one or more shards are normal.

Circuit Breaker in Context

A common implementation of this pattern is related to accessing of databases or data services. Once an established type and level of activity fails, the circuit breaker would react. With data, this is typically caused by the inability to connect to a database or a data service in front of that database.

If a call to a database resource failed after 100 consecutive attempts to connect, there is likely little value in continuing to call the database. A circuit breaker could be triggered at that threshold and the appropriate actions can be taken.

In some cases, particularly when connecting to data services, this could be the result of throttling based on a client exceeding the number of allowed calls within a given time period. The circuit breaker may inject delays between calls until such time that connections are successfully established and meet the tolerance levels.

In other cases, the data store may not be unavailable. If a redundant copy of the data is available, the system may fail over to that replica. If a true replica is unavailable or if the database service is down broadly across all data centers within a provider, a secondary approach may be taken. This could include sourcing data from a version of the data requested via an alternate data service provider. This alternate source could be from a cache, an alternate persistent data store type on the current cloud provider, a separate cloud provider, or an on premise data center. When such an alternate is not available, the service could also return a recognizable error that could be handled appropriately by the client.

Circuit Breaker Example: Netflix

Netflix, a media streaming company, is often held up as a great example of a resilient architecture. When discussing the circuit breaker pattern at Netflix, that team calls out several criteria that are included in their circuit breaker in their Netflix Tech Blog. These included:

  1. A request to the remote service times out.

  2. The thread pool and bounded task queue used to interact with a service dependency are at 100% capacity.

  3. The client library used to interact with a service dependency throws an exception.

All of these contribute to the overall error rate. When that error rate exceeds their defined thresholds, the circuit breaker is “tripped” and the circuit for that service immediately serves fallbacks without even attempting to connect to the remote service.

In that same blog entry, the Netflix team states that the circuit breaker for each of their services implements a fallback using one of the following three approaches:

  1. Custom fallback – a service client library provides an invokable fallback method or locally available data on an API server (e.g., a cookie or local cache) is used to generate a fallback response.

  2. Fail silent – a method returns a null value to the requesting client, which works well when the data being requested is optional.

  3. Fail fast – when data is required or no good fallback is available, a 5xx response is returned to the client. This approach focuses on keeping API servers healthy and enabling a quick recovery when impacted services come back online, but does so at the expense of negatively impacting the client UX.

Handling SLA Outliers: Trusted Parties and Bad Actors

To enforce an SLA, an organization should address how its data service will deal with two categories of outliers—trusted parties and bad actors.

Trusted Parties and White Listing

Trusted parties are organizations with whom the organization could have special arrangements, and for whom certain exceptions to standard SLAs might be made.

  • Third Parties with Custom Agreements

    There may be some users of a service that want to negotiate special pricing terms or policies. In some cases, a high volume of calls to the data service might warrant special pricing. In other cases, demand for a given data service could exceed the volume specified in standard usage tiers. Such customers should be defined as trusted parties to avoid inadvertently being flagged as bad actors.

  • White Listing

    The typical approach to handling trusted parties is to establish a white list. A white list, which identifies a list of trusted parties, is used by the service when it determines which business rules to apply when processing customer usage. White listing is typically done by authorizing either an IP address range or an API key.

    When establishing a consumption policy, an organization should identify if white listing is supported; how a customer would apply to be on the white list; how to add a customer to the white list; and under what circumstances a customer is removed from the white list.

  • Handling Bad Actors

    If trusted parties stand at one end of the customer spectrum, the group at the opposite end is what is referred to as “bad actors.” Bad actors place a burden on the service, typically from attempted “overconsumption.” In some cases bad behavior is genuinely accidental. In other cases it is intentional, and, in a few situations, it is malicious. These actors are labeled “bad”, as their actions – intentional or otherwise – have the ability to impact the availability of one or more services.

    The burden of bad actors can introduce unnecessary costs to the data service provider and compromise access by consumers who faithfully follow the terms of use and have a reasonable expectation of service, as spelled out in an SLA. Bad actors must therefore be dealt with in a prescribed, consistent way. The typical responses to bad actors are throttling and black listing.

  • Throttling

    Organizations should define a strategy for dealing with spikes in usage by data service consumers. Significant bursts of traffic from any consumer can put an unexpected load on the data service. When such spikes occur, the organization might want to throttle access for that consumer for a certain period of time. In this case the service refuses all requests from the consumer for a certain period of time, such as one minute, five minutes, or ten minutes. During this period, service requests from the targeted consumer result in an error message advising that they are being throttled for overuse.

    The consumer making the requests can respond accordingly, such as by altering its behavior.

    The organization should determine whether it wants to implement throttling and set the related business rules. If it determines that consumers can be throttled, the organization will also need to decide what behaviors should trigger the throttling response.

  • Black listing

    Although throttling should correct the behavior of bad actors, it might not always be successful. In cases in which it does not work, the organization might want to ban a consumer. The opposite of a white list, a black list identifies consumers that are barred from access to the service. The service will respond to access requests from black-listed customers appropriately, and in a fashion that minimizes the use of data service resources.

    Black listing, as with white listing, is typically done by using either an API key or with an IP address range.

    When establishing a consumption policy, the organization should specify what behaviors will place a consumer on the black list; how black listing can be appealed; and how a consumer can be removed from the black list.

“Automate All the Things”

People make mistakes. Whether it’s a developer making a code change that could have unexpected consequences, a DBA accidentally dropping a table in a database, or an operations person who makes a change but doesn’t document it, there are multiple opportunities for a person to inadvertently make a service less resilient.

To reduce human error, a logical approach is to reduce the amount of humans in the process. Through the introduction of automation, you limit the ability for ad hoc, inadvertent deltas from expected behavior to jeopardize your service.

There is a meme in the DevOps community with a cartoon character saying “Automate All the Things.” In the cloud, most services are exposed with an API. From development tools to virtualized infrastructure to platform services to solutions delivered as Software as a Service, most things are scriptable.

Scripting is highly recommended. Scripting makes deployment and management consistent and predictable and pays significant dividends for the investment.

Automating Deployment

One of the key areas of automation is in the building and deployment of a solution. Automation can make it easy for a developer team to test and deploy to multiple environments. Development, test, staging, beta, and production can all be deployed readily and consistently through automated builds. The ability to deploy consistently across environments works toward ensuring that what’s in production is representative of what’s been tested.

Consider the following as “code”: scripts, configuration files, and meta-data related to deployment. Code also includes the management of these artifacts in source control to:

  • Document changes.

  • Allow versioning.

  • Provide role based access control.

  • Ensure content is backed up.

  • Provide a single source of truth.

Establish and Automating a Test Harness

Testing is another area that can be automated. Like automated deployment, establishing automated testing is valuable in ensuring that your system is resilient and stays resilient over time. As code and usage of your service evolves it’s important to remain that all appropriate testing is done, both functionally and at scale.

Automating Data Archiving and Purging

One of the areas that gets little attention is that of data archiving and purging. Data volume is growing and continues to grow at a higher volume and in greater variety than any time in history. Depending on the database technology and the types of queries required, unnecessary data can reduce the response time of your system and increase costs unnecessarily. For resiliency plans that include one or more replicas of a data store, removing all but the necessary data can expedite management activities such as backing up and restoring data.

Identify the requirements for your solution related to data needed for core functionality, data needed for compliance purposes but can be archived, and data that is no longer necessary and can be purged.

Utilize the APIs available from the related products and services to automate the implementation of these requirements.

Understand Fault Domains and Upgrade Domains

When building a resilient architecture, it’s also important to understand the concepts of fault domains and upgrade domains.

Fault Domains

Fault domains constrain the placement of services based on known hardware boundaries and the likelihood that a particular type of outage will affect a set of machines. A fault domain is defined as a series of machines can fail simultaneously, and are usually defined by physical properties (a particular rack of machines, a series of machines sharing the same power source, etc).

Upgrade Domains

Upgrade domains are similar to fault domains. Upgrade domains define a physical set of services that are updated by the system at the same time. The load balancer at the cloud provider must be aware of upgrade domains in order to ensure that if a particular domain is being updated that the overall system remains balanced and services remain available.

Upgrades can occur at the:

  • Hypervisor level (“HostOS upgrade”).

  • The operating system level (“Guest OS Upgrade”).

  • As the result of deploying an application or service upgrade to the environment.

Depending on the cloud provider and platform services utilized, fault domains and upgrade domains may be provided automatically, be something your service can opt-in to via APIs, or require a 1st or 3rd party solution.

Resiliency During a Fault Domain Failure During an Upgrade

While Fault Domains and Upgrade Domains are different, there is a scenario where they intersect. In that scenario, a hardware fault takes a virtual machine offline while an upgrade is occurring on another VM instance simultaneously. In this case, two VMs would be offline. If a service deployment only contains two virtual machines, the service will be offline. Take this into consideration when evaluating the number of instances you need to deliver the SLA you want.

Understanding Affinity Groups

Cloud platforms often provide the ability to identify a level of affinity for a group of resources. We call these resources an “affinity group” or an “affinity set.” This helps the underlying platform with the placement of related resources together and the allocation of instances across Fault and Upgrade domains.

Identify Compute Redundancy Strategies

On-premises solutions have often relied on redundancy to help them with availability and scalability. From an availability standpoint, redundant data centers provided the ability to increase likelihood of business continuity in the face of infrastructure failures in a given data center or part of a data center.

For applications with geo-distributed consumers, traffic management and redundant implementations routed users to local resources, often with reduced latency.

noteNote
Data resiliency, which includes redundancy, is covered as a separate topic in the section titled Establishing a Data Resiliency Approach.

Redundancy and the Cloud

On-premises, redundancy has historically been achieved through duplicate sets of hardware, software, and networking. Sometimes this is implemented in a cluster in a single location or distributed across multiple data centers.

When devising a strategy for the cloud, it is important to rationalize the need for redundancy across three vectors. These vectors include deployed code within a cloud provider’s environment, redundancy of providers themselves, and redundancy between the cloud and on premises.

Deployment Redundancy

When an organization has selected a cloud provider, it is important to establish a redundancy strategy for the deployment within the provider.

If deployed to Platform as a Service (PaaS), much of this may be handled by the underlying platform. In an Infrastructure as a Service (IaaS) model, much of this is not.

  • Deploy n number of roles with in a data center

    The simplest form of redundancy is deploying your solution to multiple instances within a single cloud provider. By deploying to multiple instances, the solution can limit downtime that would occur when only a single node is deployed.

    In many Platform as a Service environments, the state of the virtual machine hosting the code is monitored and virtual machines detected to be unhealthy can be automatically replaced with a healthy node.

  • Deploy Across Multiple Data Centers

    While deploying multiple nodes in a single data center will provide benefits, architectures must consider that an entire data center could potentially be unavailable. While not a common occurrence, events such as natural disasters, war, etc. could result in a service disruption in a particular geo-location.

    To achieve your SLA, it may be appropriate for you to deploy your solution to multiple data centers for your selected cloud provider. There are several approaches to achieving this, as identified below.

    1. Fully Redundant Deployments in Multiple Data Centers

      The first option is a fully redundant solution in multiple data centers done in conjunction with a traffic management provider. A key consideration for this approach will be impact to the compute-related costs for this type of redundancy, which will increase 100% for each additional data center deployment.

    2. Partial Deployment in Secondary Data Center(s) for Failover

      Another approach is to deploy a partial deployment to a secondary data center of reduced size. For example, if the standard configuration utilized 12 compute instances, the secondary data center would contain a deployment containing 6 compute instances.

      This approach, done in conjunction with traffic management, would allow for business continuity with degraded service after an incident that solely impacted the primary center.

      Given the limited number of times a data center goes offline entirely, this is often seen as a cost-effective approach for compute – particularly if a platform allows the organization to readily onboard new instances in the second data center.

    3. Divided Deployments across Multiple Data Centers with Backup Nodes

      For certain workloads, particularly those in the financial services vertical, there is a significant amount of data that must be processed within a short, immovable time window. In these circumstances, work is done in shorter bursts and the costs of redundancy are warranted to deliver results within that window.

      In these cases, code is deployed to multiple data centers. Work is divided and distributed across the nodes for processing. In the instance that a data center becomes unavailable, the work intended for that node is delivered to the backup node which will complete the task.

    4. Multiple Data Center Deployments with Geography Appropriate Sizing per Data Center

      This approach utilizes redundant deployments that exist in multiple data centers but are sized appropriately for the scale of a geo-relevant audience.

Provider Redundancy

While data-center-centric redundancy is good, SLAs are at the Service Level vs. the data center level. Many commercial services today will deploy new versions to a “slice” of production to validate the code in production. However, there is the possibility that the services the provider delivers could become unavailable across multiple or all data centers.

Based on the SLAs for a solution, it may be desirable to also incorporate provider redundancy. To realize this, cloud-deployable products or cloud services that will work across multiple cloud platforms must be identified. Microsoft SQL Server, for example, can be deployed in a Virtual Machine inside of Infrastructure as a Service offerings from most vendors.

For cloud provided services, this is more challenging as there are no standard interfaces in place, even for core services such as compute, storage, queues, etc. If provider redundancy is desired for these services, it is often achievable only through an abstraction layer. An abstraction layer may provide enough functionality for a solution, but it will not be innovated as fast as the underlying services and may inhibit an organization from being able to readily adopt new features delivered by a provider.

If redundant provider services may are warranted, it can be at one of several levels--an entire application, a workload, or an aspect of a workload. At the appropriate level, evaluate the need for compute, data, and platform services and determine what must truly be redundant and what can be handled via approaches to provide graceful degradation.

On-Premises Redundancy

While taking a dependency on a cloud provider may make fiscal sense, there may be certain business considerations that require on-premises redundancy for compliance and/or business continuity.

Based on the SLAs for a solution, it may be desirable to also incorporate on-premises redundancy. To realize this, private cloud-deployable products or cloud services that will work across multiple cloud types must be identified. As with the case of provider redundancy, Microsoft SQL Server is a good example of a product that can be deployed on-premises or in an IaaS offering.

For cloud provided services, this is more challenging as there are often no on-premises equivalents with interface and capability symmetry.

If redundant provider services are required on premises, this can be at one of several levels--an entire application, a workload, or an aspect of a workload. At the appropriate level, evaluate the need for compute, data, and platform services and determine what must truly be redundant and what can be handled via approaches to provide graceful degradation.

Redundancy Configuration Approaches

When identifying your redundancy configuration approaches, classifications that existed pre-cloud also apply. Depending on the types of services utilized in your solution, some of this may be handled by the underlying platform automatically. In other cases, this capability is handled through technologies like Windows Fabric.

  1. Active/active — Traffic intended for a failed node is either passed onto an existing node or load balanced across the remaining nodes. This is usually only possible when the nodes utilize a homogeneous software configuration.

  2. Active/passive — Provides a fully redundant instance of each node, which is only brought online when its associated primary node fails. This configuration typically requires the most extra hardware.

  3. N+1 — Provides a single extra node that is brought online to take over the role of the node that has failed. In the case of heterogeneous software configuration on each primary node, the extra node must be universally capable of assuming any of the roles of the primary nodes it is responsible for. This normally refers to clusters which have multiple services running simultaneously; in the single service case, this degenerates to active/passive.

  4. N+M — In cases where a single cluster is managing many services, having only one dedicated failover node may not offer sufficient redundancy. In such cases, more than one (M) standby servers are included and available. The number of standby servers is a tradeoff between cost and reliability requirements.

  5. N-to-1 — Allows the failover standby node to become the active one temporarily, until the original node can be restored or brought back online, at which point the services or instances must be failed-back to it in order to restore high availability.

  6. N-to-N — A combination of active/active and N+M, N to N redistributes the services, instances or connections from the failed node among the remaining active nodes, thus eliminating (as with active/active) the need for a 'standby' node, but introducing a need for extra capacity on all active nodes.

Traffic Management

Whether traffic is always geo-distributed or routed to different data centers to satisfy business continuity scenarios, traffic management functionality is important to ensure that requests to your solution are being routed to the appropriate instance(s).

It is important to note that taking a dependence on a traffic management service introduces a single point of failure. It is important to investigate the SLA of your application’s primary traffic management service and determine if alternate traffic management functionality is warranted by your requirements.

Establish a Data Partitioning Strategy

While many high scale cloud applications have done a fine job of partitioning their web tier, they are less successful in scaling their data tier in the cloud. With an ever growing diversity of connected devices, the level of data generated and queried is growing at levels not seen before in history. The need to be able to support 500,000 new users per day, for example, is now considered reasonable.

Having a partitioning strategy is critically important across multiple dimensions, including storing, querying, or maintaining that data.

Decomposition and Partitioning

Because of the benefits and tradeoffs of different technologies, it is common to leverage technologies that are most optimal for the given workload.

Having a solution that is decomposed by workloads provides you with the ability to choose data technologies that are optimal for a given workload. For example, a website may utilize table storage for content for an individual, utilizing partitions at the user level for a response experience. Those table rows may be aggregated periodically into a relational database for reporting and analytics.

Partitioning strategies may, and often will, vary based on the technologies chosen.

When defining your strategy, it is also important to identify outliers that may require a modified or parallel strategy. For example, if you were developing a social networking site, a normal user might have 500 connections, while a celebrity might have 5,000,000.

If the site expects 100M users and celebrities constitute less than 50,000, the core partitioning strategy would be optimized for a typical user. You would manage celebrities differently. While you would group a large number of users in a single partition, you may allocate a celebrity to a partition of his own.

Understanding the 3 Vs

To properly devise a partitioning strategy, an organization must first understand it.

The 3 Vs, made popular by Gartner, look at three different aspects of data. Understanding how the 3 Vs relate to your data will assist you in making an informed decision on partitioning strategies.

  • Volume

    Volume refers to the size of the data. Volume has very real impacts on the partitioning strategy. Volume limitations on a particular data technology may force partitioning due to size limitations, query speeds at volume, etc.

  • Velocity

    Velocity refers to the rate at which your data is growing. You will likely devise a different partitioning strategy for a slow growing data store vs. one that needs to accommodate 500,000 new users per day.

  • Variety

    Variety refers to the different types of data that are relevant to the workload. Whether it’s relational data, key-value pairs, social media profiles, images, audio files, videos, or other types of data, it’s important to understand it. This is both to choose the right data technology and make informed decisions for your partitioning strategy.

Horizontal Partitioning

Likely the most popular approach to partitioning data is to partition it horizontally. When partitioning horizontally, a decision is made on criteria to partition a data store into multiple shards. Each shard contains the entire schema, with the criteria driving the placement of data into the appropriate shards.

Based on the type of data and the data usage, this can be done in different ways. For example, an organization could choose to partition their data based on a customer last name. In another case, the partition could be date centric, partitioning on the relevant calendar interval of hour, day, week, or month.

Diagram of horizontal partitioning

Figure 5: An example of horizontal partitioning by last name

Vertical Partitioning

Another approach is vertical partitioning. This optimizes the placement of data in different stores, often tied to the variety of the data. Figure 5 shows an example where metadata about a customer is placed in one store while thumbnails and photos are placed in separate stores.

Vertical partitioning can result in optimized storage and delivery of data. In Figure 5, for example, if the photo is rarely displayed for a customer, returning 3 megabytes per records can add unnecessary costs in a pay as you go model.

Diagram of vertical partitioning

Figure 6: An example of vertical partitioning.

Hybrid Partitioning

In many cases it will be appropriate to establish a hybrid partitioning strategy. This approach provides the efficiencies of both approaches in a single solution.

Figure 6 shows an example of this, where the vertical partitioning seen earlier is now augmented to take advantage of horizontal partitioning of the customer metadata.

Diagram of horizontal partitioning

Figure 7: An example of horizontal partitioning.

Networking

At the heart of cloud computing is the network. The network is crucial as it provides the fabric or backbone for devices to connect to services as well as services connecting to other services. There are three network boundaries to consider in any FailSafe application.

Those network boundaries are detailed below with Azure used as an example to provide context:

  1. Role boundaries are traditionally referred to as tiers. Common examples are a web tier or a business logic tier. If we look at Azure as an example, it formally introduced roles as part of its core design to provide infrastructure support the multi-tier nature of modern, distributed applications. Azure guarantees that role instances belonging to the same service are hosted within the scope of a single network environment and managed by a single fabric controller.

  2. Service boundaries represent dependencies on functionality provided by other services. Common examples are a SQL environment for relational database access and a Service Bus for pub/sub messaging support. Within Azure, for example, service boundaries are enforced through the network: no guarantee will be given that a service dependency will be part of the same network or fabric controller environment. That might happen, but the design assumption for any responsible application has to be that any service dependency is on a different network managed by a different fabric controller.

    Diagram of cloud service role instances
    Figure 8

  3. Endpoint boundaries are external to the cloud. They include any consuming endpoint, generally assumed to be a device, connecting to the cloud in order to consume services. You must make special considerations in this part of the design due to the variable and unreliable nature of the network. Role boundaries and service boundaries are within the boundaries of the cloud environment and one can assume a certain level of reliability and bandwidth. For the external dependencies, no such assumptions can be made and extra care has to be given to the ability of the device to consume services, meaning data and interactions.

    The network by its very nature introduces latency as it passes information from one point of the network to another. In order to provide a great experience for both users and as dependent services or roles, the application architecture and design should look for ways to reduce latency as much as sensible and manage unavoidable latency explicitly. One of the most common ways to reduce latency is to avoid services calls that involve the network--local access to data and services is a key approach to reduce latency and introduce higher responsiveness Using local data and services also provides another layer of failure security; as long as the requests of the user or application can be served from the local environment, there is no need to interact with other roles or services, removing the possibility of dependent component unavailability as a failure point.

Caching

Caching is a technique that can be used to improve data access speeds when it’s not possible to store data locally. Caching is utilized to great effect in most cloud service operating at scale today. As the definition provided by Wikipedia outlines, a cache provides local access to data that is repeatedly asked for by applications. Caching relies on two things:

  • Usage patterns for the data by users and dependent applications are predominantly read-only. In certain scenarios such as ecommerce websites, the percentage of read-only access (sometimes referred to as browsing) is up to 95% of all user interactions with the site.

  • The application’s information model provides an additional layer of semantic information that supports the identification of stable, singular data that is optimal for caching.

Device caching

While not the focus for the FailSafe initiative, device caching is one of the most effective ways to increase the usability and robustness of any devices + services application. Numerous ways exist to provide caching services on the device or client tier, ranging from the HTML5 specification providing native caching capabilities implemented in all the standard browsers to local database instances such as SQL Server Compact Edition or similar.

Distributed caching

Distributed caching is a powerful set of capabilities, but its purpose is not to replace a relational database or other persistent store; rather, its purpose is to increase the responsiveness of distributed applications that by nature are network centric and thus latency sensitive. A side benefit of introducing caching is the reduction of traffic to the persistent data store, which drives the interactions with your data service to a minimum.

  • Information models optimized for caching

    noteNote
    A lot of the content in this section is based upon the great work by Pat Helland when he was thinking about data in a service-oriented architecture world. One can read the entire article on the Microsoft Developer Network.

    Cached data by its very nature is stale data, i.e. data that is not necessarily up-to-date anymore. A great example of cached data although from a very different domain is a product catalog that is being sent to thousands of households. The data used to produce the product catalog was up-to-date when the catalog was created. Once the printing presses were going, the data, by the very nature of time passing during the catalog production process, went stale. Due to cached data being stale, the attributes of data with respect to stability and singularity are critical to the caching design:

    • Stability - Data that has an unambiguous interpretation across space and time. This often means data values that do not change. For example, most enterprises never recycle customer identifications or SKU numbers. Another technique to create stable data is the addition of expiration dates to the existing data. The printed product catalog example above is a great example. Generally retailers accept orders from any given catalog for 2 periods of publication. If one publishes a catalog four times a year, the product catalog data is stable for 6 months and can be utilized for information processing such as placing and fulfilling orders.

      Stable data is often referenced as master or reference data. In the FailSafe initiative we will utilize the term reference data as it is a semantically more inclusive term than master data. In a lot of enterprises, master data has a very specific meaning and is narrower than reference data.

    • Singularity - Data that can be isolated through association with uniquely identifiable instances with no or low concurrent updates. Take the example of a shopping basket. While the shopping basket will clearly be updated, the updates occur relatively infrequently and can be completely isolated from the storage as well as the processing perspective.

      Isolatable data as described above is referenced as activity data or session data.

      With these two attributes in mind, the following schema emerges:

      Diagram of data types
      Figure 9

  • Information modeling

    A lot of application architects or developers think about object or entity models when they think about information modeling. While object or entity models are part of the art and science of information modeling, a lot more goes into an information model for a FailSafe application. 

    A first view of the thought processes required for today's distributed world focused on stability and singularity. Another key element is understanding how the data is being utilized in the interaction with the user/device or as part of the to-be-implemented business process(es). Object-oriented modeling ought to be part of the information design in a number of ways:

    • Information hiding - Hiding or not providing access to information that isn't useful for the user or the business process is one of the best ways of avoiding conflicts with resource or shared data that is stored and managed in a relational database.

      A great example of how information hiding is utilized to great effect to improve concurrency is the difference between a typical ERP system and the way Amazon.com manages most inventory scenarios. In a typical implementation of an ERP system, the inventory table is one of the most congested or hot tables (if one assumes a relational database implementation). The ERP application usually tries to lock the inventory until the user has finished the transaction, providing the exact inventory count to the application at the time of business transaction start. Some systems--such as SAP--avoid the database lock but acquire an application lock in their enqueue system. A great many techniques in the database layer have been developed to help with this congestion: optimistic concurrency control, dirty reads, etc. None really work cleanly and all have side effects. In certain scenarios you do want to lock the table, but those ought to be few and far in between.

      Amazon.com solved the problem in a far simpler way: they offered the user a service-level objective (SLO) that the user could opt-in to and accept or reject. The SLO was mostly expressed as "usually available in N hours"--the user did not see the actual count of available books or other items, but was offered information about availability nonetheless. No database locks were required, in fact, there was no database connectivity required at all. If the system wasn't able to meet the SLO, it would send an apology to the buyer (usually via email) and offer an updated SLO.

    • Fungible resources: The dictionary defines fungible as follows: 'fungible (especially of goods) being of such nature or kind as to be freely exchangeable or replaceable, in whole or in part, for another of like nature or kind." The core of the idea, with the goal to once again reduce the friction of accessing shared data, to use information modeling to provide a way to have the user interact with a fungible instance of data instead of a specific instance. A great example is booking a hotel room. One can express a lot of specificity, such as number of beds, ocean view, smoking/non-smoking, etc., without ever referring to room '123'. Using modeling like that, it is entirely feasible to cache information about pools of fungible data, execute the business process against the pool, and only after finalization of the check-in process assign a specific room out of that pool. Hybrid models of removing a specific room out of a pool once the check-in process begins are entirely feasible as well.

  • Managing the cache

    Caching the right information at the right time is a key part of a successful caching strategy. Numerous techniques exist for loading the cache: a good overview is described the “Caching in the Distributed Environment”. In addition, the sections below outline a few considerations for FailSafe application design that is dependent upon distributed caching.

    • Reference data - If the hosting environment (fabric controller or datacenter) encounters a disaster, your application will be moved to another environment. In the case where an active instance of your application is already active (active-active design), the likelihood that your cache already contains a lot of the relevant information (especially reference data) is high. In the case that a new instance of your application gets spun up, no information will be in the cache nodes. You should design your application so that on a cache miss, it automatically loads the desired data. In the case of a new instance, you can have a startup routine that bulk loads reference data into the cache. A combination of the two is desirable as users might be active as soon as the application is being served by the cloud infrastructure.

    • Activity data - The basic techniques described for reference data hold true for activity data as well. However, there is a specific twist to activity data. Reference data is assumed to be available in any persistent store of the application. As it will change on a less frequent basis, synchronization ought not to be a problem, although it needs to be considered. However, activity data, albeit being updated in isolation and with low frequency, will be more volatile than reference data. Ideally the distributed cache persists activity data on a frequent basis and replicates the data between the various instances of the application. Take care to ensure that the persistence and synchronization intervals are spaced far enough to avoid contention but close enough to keep possible data loss at a minimum.

Establishing a Data Resiliency Approach

A common misunderstanding is the relationship, specifically the areas of responsibility, between platform and application. One of the areas where this is most troublesome is in respect to data.

While a platform such as Azure will deliver on promises of storing multiple copies of the data (and in some services even going so far as to provide geo-redundancy), the data that is stored is driven by the application, workload, and its component services. If the application takes an action that corrupts its application data, the platform stores multiple copies of it.

When establishing your failure modes and failure points it’s important to identify areas of the application that could potentially cause data corruption. While the point of origin could vary from bad code or poison messages to your service, it’s important to identify the related failure modes and failure points.

Application Level Remediation

  • Idempotency

    A core assumption with connected services is that they will not be 100% available and that transient fault handling with retry logic is a core implementation approach. In cases where retry logic is implemented, there is the potential for the same message to be sent more than once, for messages to be sent out of sequence, etc.

    Operations should be designed to be idempotent, ensuring that sending the same message multiple times does not result in an unexpected or polluted data store.

    For example, inserting data from all requests may result in multiple records being added if the service operation is called multiple times. An alternate approach would be to implement the code as an intelligent “upsert” which performs an update if a record exists or an insert if it does not. A timestamp or global identifier could be used to identify new vs. previously processed messages, inserting only newer ones into the database and updating existing records if the message is newer than what was received in the past.

  • Workload Activities and Compensating Behavior

    In addition to idempotency, another area for consideration is the concept of compensating behavior.

    A real world example of compensating behavior is seen when returning of a product which was paid for with a credit card. In this scenario, a consumer visits a retailer, provides a credit card and a charge is applied to the consumers credit card account. If the consumer returns the product to the retailer, a policy is evaluated and if the return conforms to the policy, the retailer issues a credit for the amount of the purchase to the consumers credit card account.

    In a world of an every growing set of connected systems and the emergence of composite services, the importance of understanding how to handle the compensating behavior is important.

    For many developers of line-of-business applications, the concepts of transactions are not new, but the frame of reference is often tied to the transactional functionality exposed by local data technologies and related code libraries. When looking at the concept in terms of the cloud, this mindset needs to take into account new considerations related to orchestration of distributed services.

    A service orchestration can span multiple distributed systems and be long running and stateful. The orchestration itself is rarely synchronous and it can span from seconds to years based on the business scenario.

    In a supply chain scenario, that could tie together 25 organizations in the same workload activity. For example, there may be a set of 25 or more systems that are interconnected in one or more service orchestrations.

    If success occurs, the 25 systems must be made aware that the activity was successful. For each connection point in the activity, participant systems can provide a correlation ID for messages it receives from other systems. Depending on the type of activity, the receipt of that correlation ID may satisfy the party that the transaction is notionally complete. In other cases, upon the completion of the interactions of all 25 parties, a confirmation message may be sent to all parties (either directly from a single service or via the specific orchestration interaction points for each system).

    To handle failures in composite and/or distributed activities, each service would expose a service interface and operation(s) to receive requests to cancel a given transaction by a unique identifier. Behind the service façade, workflows would be in place to compensate for the cancellation of this activity. Ideally these would be automated procedures, but they can be as simple as routing to a person in the organization to remediate manually.

Backups

In addition to application-level remediation to avoid data corruption, there is also remediation that is put in place to provide options if application remediation is not successful.

Processes for both creating and restoring backup copies of your data store – either in whole or in part – should be part of your resiliency plan. While the concepts of backing up and restoring data are not new, there are new twists to this in the cloud.

Your backup strategy should be defined with a conscious understanding of the business requirements for restoring data. If a data store is corrupted or taken offline due a disaster scenario, you need to know what type of data must be restored, what volume must be restored, and what pace is required for the business. This will impact your overall availability plan and should drive your backup and restore planning.

  • Relational Databases

    Backing up of relational databases is nothing new. Many organization have tools, approaches, and processes in place for the backing up of data to either satisfy disaster recovery or compliance needs. In many cases traditional backup tools, approaches, and processes may work with little or no modification. In addition, there are new or variant alternatives, such as backing up data and storing a copy in cloud-based blob storage, that can be considered.

    When evaluating existing processes and tools, it’s important to evaluate which approach is appropriate for the cloud based solution. In many cases, one or more of the approaches listed below will be applied to remediate different failure modes.

    1. Total Backup - This is a backup of a data store in its entirety. Total backups should occur based on a schedule dictated by your data volume and velocity. A total backup is the complete data set needed to deliver on the service level agreement for your service. Mechanisms for this type of backup are generally available either by the database / database service provider or its vendor ecosystem.

    2. Point in Time - A point in time backup is a backup that reflects a given point in time in the databases existence. If an error were to occur in the afternoon that corrupted the data store, for example, a point in time backup done at noon could be restored to minimize business impact.

      Given the ever-growing level of connectivity of individuals, the expectation to engage with your service at any time of day makes the ability to quickly restore to a recent point in time a necessity.

    3. Synchronization - In addition to traditional backups, another option is synchronization of data. Data could be stored in multiple data centers, for example, with a periodic synchronization of data from one datacenter to another. In addition to providing synchronized data in solutions that utilize traffic management as part of a normal availability plan, this can also be used to fail over to a second data center if there is a business continuity issue.

      Given the constant connectivity of individuals consuming services, downtime becomes less and less acceptable for a number of scenarios and synchronization can be a desirable approach.

      Patterns for synchronization can include:

      - data center to data center of a given cloud provider

      - data center to data center across cloud providers

      - data center to data center from on premise to a given cloud provider

      - data center to device synchronization for consumer specific data slices

  • Sharded Relational Databases

    For many, the move to the cloud is driven by a need to facilitate large numbers of users and high traffic scenarios such as those related to mobile or social applications. In these scenarios, the application pattern often involves moving away from a single database model to a number of database shards that contain a portion of the overall data set and are optimized for large scale engagement. One recent social networking project built on Azure launched with a total of 400 database shards.

    Each shard is a standalone database and your architecture and management should facilitate total backups, point in time backups, and restoration of backups for both individual shards and a complete data set including all shards.

  • NoSQL Data Stores

    In addition to relational data stores, backup policies should be considered for “Not only SQL” or NoSQL data stores as well. The most popular form of NoSQL databases provided by major cloud providers would be a form of high availability key-value pair store, often referred to as a table store.

    NoSQL stores may be highly available. In some cases they will also be geo-redundant, which can help prevent loss in the case of a catastrophic failure in a specific data center. These stores typically do not provide protections from applications overwriting or deleting content unintentionally. Application or user errors are not handled automatically by platform services such as blob storage and a backup strategy should be evaluated.

    While relational databases typically have existing and well-established tools for performing backups, many NoSQL stores do not. A popular architectural approach is to create a duplicate copy of the data in a replica NoSQL store and use a lookup table of some kind to identify which rows from the source store have been placed in the replica store. To restore data, this same table would be utilized, reading from the table to identify content in the replica store available to be restored.

    Depending on the business continuity concerns, the placement of this replica could be hosted with the same cloud provider, in the same data center, and/or the same No SQL data store. It could also reside in a different data center, a different cloud provider, and/or a different variant of NoSQL data store. The driver for placement will be largely influence by the desired SLA of your workload service and any related regulatory compliance considerations.

    A factor to consider when making this determination is cost, specifically as it relates to data ingress and egress. Cloud providers may provide free movement of data within their data center(s) and allow free passage of data into their environment. No cloud provider offers free data egress, and the cost of moving data to a secondary cloud platform provider could introduce significant costs at scale.

    noteNote
    The lookup table becomes a potential failure point and a replica of it should be considered as part of resiliency planning.

  • Blob Storage

    Like relational and NoSQL data stores, a common misconception is that the availability features implemented for a blob storage offering will remove the need to consider implementing a backup policy.

    Blob storage also may be geo-redundant, but, as discussed earlier, this does not guard against application errors. Application or user errors are not handled automatically by platform services such as blob storage and a backup strategy should be evaluated.

    Backup strategies could be very similar to those used for NoSQL stores. Due to the potentially large size of blobs, cost and time to move data will be important parts of a backup and restore strategy.

  • Restoring Backups

    By now, most people have heard the cautionary tale of the organization that established and diligently followed backup policies but never tested restoring the data. On that fateful day when a disaster did occur, they went to restore their database backup only to discover they had configured their backup incorrectly and the tapes they’d been sending offsite for years didn’t have the information they needed on them.

    Whatever back up processes are put into place, it’s important to establish testing to verify that data can be restored correctly and to ensure that restoration occurs in a timely fashion and with minimal business impact.

Content Delivery Networks

Content Delivery Networks (CDNs) are a popular way to provide availability and enhanced user experience for frequently requested files. Content in a CDN is copied to a local node on its first use, and then served up from that local node for subsequent requests. The content will expire after a designated time period, after which content must be re-copied to the local node upon the next request.

Utilizing a CDN provides a number of benefits but it also adds a dependency. As is the case with any dependency, remediation of a service failure should be proactively reviewed.

Appropriate Use of a CDN

A common misconception is that CDNs are a cure all for scale. In one scenario, a customer was confident it was the right solution for an online ebook store. It was not. Why? In a catalog of a million books, there is a small subset of books that would be frequently requested (the “hits”) and a very long tail of books that would be requested with very little predictability. Frequently requested titles would be copied to the local node on the first request and provide cost effective local scale and a pleasant user experience. For the long tail, almost every request is copied twice – once to the local node, then to the customer – as the infrequent requests result in content regularly expiring. This is evidence that a CDN improperly will have the opposite of the intended effect – a slower, more costly solution.

Design for Operations

In many cases, operations of a solution may not be planned until further along in the lifecycle. To build truly resilient applications, they should be designed for operations. Designing for operations typically will include key activities such as establishing a health model, capturing telemetry information, incorporating health monitoring services and workflows, and making this data actionable by both operations and developers.

Establish a Health Model

Development teams often overlook and sometimes completely ignore application “health.” As a result, services often go into production with two known states: up or down. Designers of resilient services should develop health models that define application health criteria, diminished health state, failure, and health dependencies.

Proactively developing a health model outlines failure modes and points, requiring that developers identify and study what-if scenarios in application design phases. To operationalize a health model, a service must be able to communicate its health. It must have a programmatic way to broadcast such information, provide an interface for that health status to be queried interactively, provide mechanisms (or hooks into existing mechanisms) through which admins can monitor application health in real-time, and establish mechanisms through which admins can—when necessary—deliver corrective “medicine” to return the application to a healthy state.

Define Characteristics

For a given service or application, diagnostics should be taken to identify data points and value ranges that indicate at least three categories of health – healthy, diminished health, and unhealthy. This information can be utilized to automate self-healing of services.

Define Interfaces

With health states defined, a service should expose health related interfaces. These interfaces will provide the following categories of operations.

  • Emit health status to subscribed partner services

    A service should emit health information to subscribed partner services. This health status should be concise, with high level health and basic diagnostics.

    The service should provide the ability for a partner service to subscribe to health messages. The delivery of health messages can occur through paths appropriate to the solution. One path could have the service place health status updates on a queue which can be retrieved by subscribers.

    An alternate approach is to allow subscribers to provide an endpoint at which a known health reporting interface is exposed. When health information changes, it can emit information to subscribers at their provided endpoints.

  • Expose interfaces for delivering telemetry data

    Telemetry is great for operations as it helps identify the current health of the application and/or services on multiple levels. It also can help quickly identify when something atypical is happening in the environment. This provides a fairly granular view of the service and is exposed to operations staff, services, and dashboards of the service provider.

    For operations, telemetry metrics can include, for example, average CPU percentage, Errors, Connections, and Queues for different roles, services, and composite applications built on services.

    Application specific telemetry is typically not automatically enabled and monitored directly by the platform itself, so telemetry should be enabled by the service and application.

  • Expose interfaces to interrogate service for additional health diagnostic information

    A service should ideally also provide interfaces to interrogate additional health diagnostic information. Subscriber services, based on the high level information received in the health status, may request additional information to determine a course forward with their relationship with the service providing its health details.

    Specifically, if the service appears to be in a poor state of health, additional information could allow a consuming service to make a decision to continue to use the service or fail over to an alternate provider.

  • Expose interfaces to remediate service health at the service and application level

    If the consumer of health information is an internal or related service, you may want to enable the service to self-remediate known issues. Just as with human medicine, service health understanding will evolve over time and health data can lead to different courses of treatment.

    A service should expose an interface to allow deliver of that treatment. In its simplest form, this could range from triggering a reboot or reimaging of a service to failing over to a different internal data or service provider.

    Understanding the health of a service can enable the service provider to identify quickly if a service is unhealthy. Automated operations can provide expedited, consistent, and prescriptive remediation of known issues and make service self-healing. This section reviews the different aspects of health in more detail.

Telemetry

Telemetry is “the process of using special equipment to take measurements of something (such as pressure, speed, or temperature) and send them by radio to another place.”

It is important that you collect telemetry data from your service. Telemetry is often categorized into one of four types: User, Business, Application, and Infrastructure.

User telemetry targets the behavior of an individual, targeted user. This provides insight about the behavior of a user and can facilitate the delivery of an individualized experience as a result.

Business telemetry contains data related to business activities and key performance indicators (KPIs) for the business. Examples of business telemetry include the number of unique active users on a site, the number of videos watched, etc.

Application telemetry contains detail on the health, performance, and activity of the application layer and its dependent services. Application telemetry would contain details, for example, on data client calls and logins, exceptions, API calls, sessions, etc.

Infrastructure telemetry includes details on the health of the underlying system infrastructure. This may include data on resources such as CPU, memory, network, available capacity, etc.

Telemetry Considerations

When identifying what data to collect and how to collect it, it is important to understand the data and what you intend to do with it.

First, determine if the purpose of the data being collected is to inform or initiate an action. The question to ask is, “how quickly should I react?” Will you use the data near real time to potentially initiate an action? Alternatively, will you use the data in a month over month trending report? The answer to these questions will inform the telemetry approach and technologies used in the architecture.

Next, identify the type of question(s) you intend to apply to the telemetry data you are collecting. Will you use the telemetry to answer known questions or will you use the telemetry for exploration. For example, for a business, KPIs are answers to known questions. However, a manufacturer who wants to explore device data for patterns that result in faults would be venturing into the unknown. For the manufacturer, the faults are derived from one or more items in the system. The manufacturer is doing exploration and would require additional data.

When you use telemetry to gain insight, you must consider the time required to gain that insight. In some cases, you will leverage telemetry to detect a spike in a device sensor reading that has a window of seconds or minutes. In other cases, you may use telemetry to identify week over week user growth for a website that has a much longer window.

Finally, consider the amount of data you can gather from a signal source within the timeframe to gain insight. You must know the amount of the source signal you need. You can then determine the best way to partition that signal and establish the appropriate mix of local and global computation.

Another consideration is how to record the sequence of events in your telemetry. Many organizations will default to time stamps. Time stamps, however, can be a challenge because server clocks in and across data centers are inconsistent. While time may be synchronized periodically, there is documented evidence that server clocks drift (slowly become more inaccurate). This drift results in changes that may impair effective analysis. For scenarios that require precision, consider alternate solutions, such as leveraging a vector clock implementation.

Signal Characterization

After you identify a telemetry approach, move forward to characterizing the signal.

You can classify the signal as continuous or discreet. An example of a continuous signal is real time event data. Log file entries are part of a discrete signal.

To satisfy the needs of your approach, you must identify how much data is in the signal. If the information is continuous, you must establish a sampling rate. If it is discreet, you must identify the records to keep or filter.

You must also identify the access type. In some cases, it will be appropriate to push the telemetry data. In other cases, you will retrieve telemetry data by pulling.

In more evolved systems, you may find that insight gained from existing pushed telemetry may result in targeting pulling for additional information. For example, if pushed telemetry indicates a suboptimal state, you can retrieve additional diagnostic data by pulling.

Telemetry Policy Considerations

All the collected data may be relevant under certain conditions, but not all telemetry data must be sent all the time. Based on the telemetry type and the amount of data required to derive insight, you may be able to retrieve smaller amounts of data. You can use local computation to generate aggregates, samplings, or subsets, and send that data to the service. If the data you receive from standard telemetry indicates a suboptimal state, for example, the service can request additional data.

When developing a telemetry strategy, consider what policies are appropriate. Telemetry requires available connectivity, and there is a cost to send data over that connection. Create policies that consider context, connectivity, and cost, and adjust the telemetry as appropriate.

Your policies must consider the context of the current state to determine what telemetry is appropriate to send right now. The insight gained from previously received telemetry and the associated business needs informs the context. This context helps direct and prioritize telemetry collection.

These devices may have varying types of connectivity (WiFi, LTE, 4G, 3G, etc.), and that connectivity may be variable. The connectivity of the device right now is relevant to determining which data should be transmitted. In scenarios where connectivity is unreliable or connectivity speeds are low, policy can prioritize the transmitted telemetry.

Connectivity is often provided at a cost. Policies will consider whether a connection is currently a metered or non-metered one. If it is a metered connection, the policies will identify the associated costs and, if available, usage caps. Many devices utilize different types of connections over the course of a day. You may prioritize or de-prioritize specific telemetry based on the cost to deliver it.

Visualizing Telemetry

Different audiences will often ingest and visualize telemetry data. Operations and Developers are two audiences for which visualization is important. However, as outlined below, the needs of each audience requires different levels of granularity.

Visualize Telemetry for Operations

Visualizing high-level operation status and lower-level telemetry data is important for the Operations staff. Automated notifications will likely be in place based on telemetry data. However, Operations will want a dashboard that helps visualize current and historical service performance.

For applications of significant scale, this information can help identify a current issue quickly or predict a future issue. It can also help Operations identify the potential impacts and root causes.

Diagram of perf dashboard

Figure 10

The above diagram from a large social site looks at API Role Average CPU%, API Errors, Users Online, Web Active Connections, Web Role CPU%, Web Errors, Web Hard Connections, Web Pooled Connections, Web App Queue, Web Soft Connections.

Telemetry and the type of reporting shown above are particularly helpful in cases where operations can remediate the errors without code modifications to the services themselves. Examples of activities that operations can perform can include deploying more roles and recycling instances.

You can leverage the visualization of historical and real-time telemetry data for:

  • Troubleshooting.

  • Post-mortems done for live site issues.

  • Training new Operations staff.

Interpret Telemetry for Developers

In addition to operations staff, software developers are an important consumer of telemetry data. Errors may not be tied to the operating environment and instead may be tied to underlying application code. Having a telemetry dashboard for developers allows them to take directed action.

Below are two screenshots of an example utility that was created for just such a purpose. The dashboard provides details on the number of errors, what categories the errors fall into, and specific data related to each of those categories. This includes an examination data including the total number errors, the total number of role instances experiencing errors, and the total number of databases having error.

Diagram of dashboard

Figure 11

Diagram of dashboard

Figure 12

For large sites with millions of engaged users, higher error accounts will happen and may be perfectly acceptable. Having a developer centric dashboard that interprets telemetry can help identify if there truly is a problem, if it’s appropriate to engage, and where in the code the issues are occurring.

See Also

Gruppinnehåll

Lägg till
Visa:
© 2014 Microsoft