Best Practices for Large-Scale Azure Cloud Services

Best Practices for the Design of Large-Scale Services on Azure Cloud Services

Updated: October 14, 2014

Authors: Mark Simms and Michael Thomassy

Contributing authors: Jason Roth and Ralph Squillace

Reviewers: Brad Calder, Dennis Mulder, Mark Ozur, Nina Sarawgi, Marc Mercuri, Conor Cunningham, Peter Carlin, Stuart Ozer, Lara Rubbelke, and Nicholas Dritsas.

Cloud computing is distributed computing; distributing computing requires thoughtful planning and delivery – regardless of the platform choice. The purpose of this document is to provide thoughtful guidance based on real-world customer scenarios for building scalable applications on Azure and SQL Database, leveraging the Platform-as-a-Service (PaaS) approach (such applications are built as Azure Cloud Services, using web and worker roles).

NOTE: All of the best practice guidance in this paper is derived from deep engagement with customers that are running production code on Azure. This paper discusses the Azure Cloud Services (PaaS) platform based on the v1.6 SDK release; it does not cover features such as Azure Web Sites or Azure Virtual Machines (IaaS).

This document covers the underlying design concepts for building Azure applications, key Azure platform capabilities, limits and features, as well as best practices for working with the core Azure services. The focus is on those applications amenable to a to a loosely-consistent distributed data store (as opposed to strictly consistent or high-density multi-tenant data models).

Shifting aspects of your applications and services into Azure can be attractive for many reasons, such as:

  • Conserving or consolidating capital expenditures into operational expenditures (capex into opex).

  • Reducing costs (and improving efficiency) by more closely matching demand and capacity.

  • Improve agility and time to market by reducing or removing infrastructure barriers.

  • Increasing audience reach to new markets such as mobile devices.

  • Benefiting from the massive scale of cloud computing by building new applications that can support a global audience in geo-distributed datacenters.

There are many excellent technical reasons to develop new applications or port some or all of an existing application to Azure. As the environment is rich with implementation alternatives, one must carefully evaluate your specific application pattern in order to select the correct implementation approach. Some applications are a good fit for Azure Cloud Services (which is a Platform-as-a-Service, or PaaS approach), whilst others might benefit from a partial or complete infrastructure-as-a-service (or IaaS) approach, such as Azure Virtual Machines. Finally, certain application requirements might be best served by using both together.

Your application should have one or more of the following three key aspects in order to maximize the benefits of Azure Cloud Services. (Not all of these need to be present in your application; an application may well yield a strong return on your investment by making very good use of Azure with only one of the following aspects. However, a workload that does not exhibit any of these characteristics is probably not an ideal fit for Azure Cloud Services.)

The important aspects that an application should be evaluated are:

  • Elastic demand. One of the key value propositions of moving to Azure is elastic scale: the ability to add or remove capacity from the application (scaling-out and scaling-in) to more closely match dynamic user demand. If your workload has a static, steady demand (for example, a static number of users, transactions, and so on) this advantage of Azure Cloud Services is not maximized.

  • Distributed users and devices. Running on Azure gives you instant access to global deployment of applications. If your workload has a captive user base running in a single location (such as a single office), cloud deployment may not provide optimal return on investment.

  • Partitionable workload. Cloud applications scale by scaling out – and thereby adding more capacity in smaller chunks. If your application depends upon scaling up (for example, large databases and data warehouses) or is a specialized, dedicated workload (for example, large, unified high-speed storage), it must be decomposed (partitioned) to run on scale-out services to be feasible in the cloud. Depending on the workload, this can be a non-trivial exercise.

To reiterate: When evaluating your application, you may well achieve a high return on the investment of moving or building on Azure Cloud Services if your workload has only one of the preceding three aspects that shine in a platform-as-a-service environment like Azure Cloud Services. Applications that have all three characteristics are likely to see a strong return on the investment.

While many aspects of designing applications for Azure are very familiar from on-premises development, there are several key differences in how the underlying platform and services behave. Understanding these differences, and as a result how to design for the platform - not against it – are crucial in delivering applications that redeem the promise of elastic scale in the cloud.

This section outlines five key concepts that are the critical design points of building large-scale, widely distributed scale-out applications for Platform-as-a-Service (PaaS) environments like Azure Cloud Services. Understanding these concepts will help you design and build applications that do not just work on Azure Cloud Services, but thrive there, returning as many of the benefits of your investment as possible. All of the design considerations and choices discussed later in this document will tie back to one of these five concepts.

In addition, it’s worth noting that while many of these considerations and best practices are viewed through the lens of a .NET application, the underlying concepts and approaches are largely language or platform agnostic.

The primary shift in moving from an on-premises application to Azure Cloud Services is related to how applications scale. The traditional method of building larger applications relies on a mix of scale-out (stateless web and application servers) and scale-up (buy a bigger multi-core/large memory system, database server, build a bigger data center, and so on). In the cloud scale-up is not a realistic option; the only path to achieving truly scalable applications is by explicit design for scale-out.

As many of the elements of an on-premise application are already amenable to scale-out (web servers, application servers), the challenge lies in identifying those aspects of the application which take a dependency on a scale-up service and converting (or mapping) that to a scale-out implementation. The primary candidate for a scale-up dependency is typically the relational database (SQL Server / Azure SQL Database).

Traditional relational designs have focused around a globally coherent data model (single-server scale-up) with strong consistency and transactional behavior. The traditional approach to providing scale against this backing store has been to “make everything stateless”, deferring the responsibility of managing state to the scale-up SQL server.

However, the incredible scalability of SQL Server carries with it a distinct lack of truly elastic scale. That is, instead of extremely responsive resource availability, you must instead buy a bigger server with an expensive migration phase, with capacity greatly exceeding demand, and that each time you expand scale-up. In addition, there is an exponential cost curve when scaling past mid-range hardware.

With the architectural underpinning of Azure Cloud Services being scale-out, applications need to be designed to work with scale-out data stores. This means design challenges such as explicitly partitioning data into smaller chunks (each of which can fit in a data partition or scale-out unit), and managing consistency between distributed data elements. This achieves scale through partitioning in a way that avoids many of the drawbacks of designing to scale up.

Pivoting from well-known scale-up techniques and toward scale-out data and state management is typically one of the biggest hurdles in designing for the cloud; addressing these challenges and designing applications that can take advantage of the scalable, elastic capabilities of Azure Cloud Services and Azure SQL Database for managing durable data is the focus of much of this document.

In a world where you run your own data center you have a nearly infinite degree of control, juxtaposed with a nearly infinite number of choices. Everything from the physical plant (air conditioning, electrical supply, floor space) to the infrastructure (racks, servers, storage, networking, and so on) to the configuration (routing topology, operating system installation) is under your control.

This degree of control comes with a cost – capital, operational, human and time. The cost of managing all of the details in an agile and changing environment is at the heart of the drive towards virtualization, and a key aspect of the march to the cloud. In return for giving up a measure of control, these platforms reduce the cost of deployment, management and increase agility. The constraint that they impose is that the size (capacity, throughput, and so on) of the available components and services is restricted to a fixed set of offerings.

To use an analogy, commercial bulk shipping is primarily based around shipping containers. These containers may be carried by various transports (ships, trains and trucks) and are available in a variety of standard sizes (up to 53 feet long). If the amount of cargo you wish to ship exceeds the capacity of the largest trailer, you either need to use:

  • Use multiple trailers. This involves splitting (or partitioning) the cargo to fit into different containers, and coordinating the delivery of the trailers.

  • Use a special shipping method. For cargo which cannot be distributed into standard containers (too large, bulky, and so on), highly specialized methods, such as barges, need to be used. These methods are typically far more expensive than standard cargo shipping.

Bringing this analogy back to Azure (and cloud computing in general), every resource has a limit. Be it an individual role instance, a storage account, a cloud service, or even a data center – every available resource in Azure has some finite limit. These may be very large limits, such as the amount of storage available in a data center (similar to how the largest cargo ships can carry over 10,000 containers), but they are finite.

With this in mind, the approach to scale is to: partition the load, and compose it across multiple scale units - be that multiple VMs, databases, storage accounts, cloud services, or data centers.

In this document we use the term scale unit to refer to a group of resources that can (a) handle a defined degree of load, and (b) composed together to handle additional load. For example, a Azure Storage account has a maximum size of 100 TB. If you need to store more than 100 TB of data, you will need to use multiple storage accounts (i.e. at least two scale units of storage).

The general design guidelines for capacity of each core service or component of Azure is discussed in later sections, along with recommended approaches for composing these services for additional scale.

Enormous amounts of time, energy and intellectual capital have been invested into building highly resilient applications on-premise. This typically boils down into separating the application into low-state components (application servers, networking) and high-state components (databases, SANs), and making each resilient against failure modes.

In this context, a failure mode refers to a combination of (a) observing the system in a failure state, (b) as a result of failure cause. For example, a database being inaccessible due to a misconfigured password update is a failure mode; the failure state is the inability to connect (connection refused, credentials not accepted), and failure cause is a password update which was not properly communicated to application code.

Low state components deliver resiliency through loosely coupled redundancy, with their “integration” into the system managed by external systems. For example, placing additional web servers behind a load balancer; each web server is identical to the others (making adding new capacity a matter of cloning a base web server image), with integration into the overall application managed by the load balancer.

High state components deliver resiliency through tightly coupled redundancy, with the “integration” tightly managed between components. Examples are:

  • SQL Server. Adding a redundant SQL Server instance as part of an active/passive cluster requires careful selection of compatible (that is, identical!) hardware, and shared storage (such as a SAN) to deliver transactional consistent failover between multiple nodes.

  • Electrical supply. Providing redundant electrical supply presents a very sophisticated example, requiring multiple systems acting in concert to mitigate local (multiple power supplies for a server, with onboard hardware to switch between primary and secondary supplies) and center (backup generators for loss of power) wide issues.

Resiliency solutions based around tightly coupled approaches are inherently more expensive than loosely coupled “add more cloned stuff” approaches, by requiring highly trained personnel, specialized hardware, with careful configuration and testing. Not only is it hard to get it right, but it costs money to do it correctly.

This approach of focusing on ensuring that the hardware platforms are highly resilient can be thought of as a “titanium eggshell”. To protect the contents of the egg, we coat the shell in a layer of tough (and expensive) titanium.

Experience running systems at scale (see for further discussion) has demonstrated that in any sufficiently large system (such as data systems at the scale of Azure) the sheer number of physical moving parts leads to some pieces of the system always being broken. The Azure platform was designed in such a way to work with this limitation, rather than against it, relying on automated recovery from node-level failure events. This design intent flows through all core Azure services, and is key to designing an application that works with the Azure availability model.

Shifting to Azure changes the resiliency conversation from an infrastructure redundancy challenge to a services redundancy challenge. Many of the core services which dominate availability planning on premise “just keep working” in Azure:

  • SQL Database automatically maintains multiple transactionally consistent replicas of your data. Node level failures for a database automatically fail over to the consistent secondary; contrast the ease of this experience with the time and expense required to provide on-premise resiliency.

  • Azure Storage automatically maintains multiple consistent copies of your data (For further reading, see Node level failures for a storage volume automatically fail over to a consistent secondary. As with SQL Database, contrast this completely managed experience with the direct management of resilient storage in an on-premise cluster or SAN.

However, the title of this section is availability, not resiliency. Resiliency is only one part of the overall story of continuously delivering value to users within the bounds of an SLA. If all of the infrastructure components of a service are healthy, but the service cannot cope with the expected volume of users, it is not available, nor is it delivering value.

Mobile- or social-centric workloads (such as public web applications with mobile device applications) tend to be far more dynamic than those that target captive audiences, and they require a more sophisticated approach to handling burst events and peak load. The key concepts to keep in mind for designing availability into Azure applications is discussed in detail throughout this document, based on these pivots:

  • Each service or component in Azure provides a certain Service Level Agreement (SLA); this SLA may not be directly correlated with the availability metric required to run your application. Understanding all of the components in your system, their availability SLA, and how they interact is critical in understanding the overall availability that can be delivered to your users.

    • Avoid single points of failure that will degrade your SLA, such as single-instance roles.

    • Compose or fall back to multiple components to mitigate the impact of a specific service being offline or unavailable.

  • Each service or component in Azure can experience failure, either a short-lived transient, or a long-lived event. Your application should be written in such a way as to handle failure gracefully.

    • For transient failures, provide appropriate retry mechanisms to reconnect or resubmit work.

    • For other failure events, provide rich instrumentation on the failure event (reporting the error to operations), and a suitable error message back to the user.

    • Where possible, fall back to a different service or workflow. For example, if a request to insert data into SQL Database fails (for any non-transient reason, such as invalid schema), write the data into blob storage in a serialized format. This would allow the data to be durably captured, and submitted to the database after the schema issue has been resolved.

  • All services will have a peak capacity, either explicitly (through a throttling policy or peak load asymptote) or implicitly (by hitting a system resource limit).

    • Design your application to gracefully degrade in the face of hitting resource limits, and taking appropriate action to soften the impact to the user.

    • Implement appropriate back-off/retry logic to avoid a “convoy” effect against services. Without an appropriate back-off mechanism, downstream services will never have a chance to catch up after experiencing a peak event (as your application will be continuously trying to push more load into the service, triggering the throttling policy or resource starvation).

  • Services which can experience rapid burst events need to gracefully handle exceeding their peak design load, typically through shedding functionality.

    • Similar to how the human body restricts blood flow to the extremities when dealing with extreme cold, design your services to shed less-important services during extreme load events.

    • The corollary here is that not all services provided by your application have equivalent business criticality, and can be subject to differential SLAs.

These high level concepts will be applied in more detail in each of the sections describing the core Azure services, along with availability goals for each service or component, and recommendations on how to design for availability. Keep in mind that the datacenter is still a single-point-of-failure for large applications; from electrical supplies (example here) to systems error (example here), infrastructure and application issues have brought down data centers. While relatively rare, applications requiring the highest levels of uptime should be deployed in multiple redundant data centers.

Deploying applications in multiple data centers requires a number of infrastructure and application capabilities:

  • Application logic to route users of the services to the appropriate data center (based on geography, user partitioning, or other affinity logic).

  • Synchronization and replication of application state between data centers, with appropriate latency and consistency levels.

  • Autonomous deployment of applications, such that dependencies between data centers are minimized (that is, avoid the situation wherein a failure in data center A triggers a failure in data center B).

As with availability, providing disaster recovery solutions (in case of data center loss) have required enormous time, energy and capital. This section will focus on approaches and considerations for providing business continuity in the face of system failure and data loss (whether system or user triggered), as the term “disaster recovery” has taken on specific connotations around implementation approaches in the database community.

Delivering business continuity breaks down to:

  • Maintaining access to and availability of critical business systems (applications operating against durable state) in the case of catastrophic infrastructure failure.

  • Maintaining access to and availability of critical business data (durable state) in the case of catastrophic infrastructure failure.

  • Maintaining availability of critical business data (durable state) in the case of operator error or accidental deletion, modification or corruption.

The first two elements have traditionally been addressed in the context of geographic disaster recovery (geo-DR), with the third being the domain of data backup and data restoration.

Azure changes the equation significantly for availability of critical business systems, enabling rapid geographically distributed deployment of key applications into data centers around the globe. Indeed, the process of rolling out a geographically distributed application is little different than rolling out a single cloud service.

The key challenge remains managing access to durable state; accessing durable state services (such as Azure Storage and SQL Database) across data centers typically produces sub-optimal results due to high and/or variable latency, and does not fulfill the business continuity requirement in case of data center failure.

As with resiliency, many Azure services provide (or have in their roadmap) automatic geo replication. For example, unless specifically configured otherwise, all writes into Azure storage (blob, queue, or table) are automatically replicated to another data center (each data center has a specific “mirror” destination within the same geographic region). This greatly reduces the time and effort required to provide traditional disaster recovery solutions on top of Azure. An overview of the geo replication capabilities of the core Azure services managing durable state is provided in later sections.

For maintaining business continuity in the face of user or operator error, there are several additional considerations to account for in your application design. While Azure Storage provides limited audit capability through the Storage Analytics feature (described in a later section), it does not provide any point-in-time restore capabilities. Services requiring resiliency in the face of accidental deletion or modification will need to look at application-centric approaches, such as periodically copying blobs to a different storage account.

SQL Database provides basic functionality for maintaining historical snapshots of data, including DB Copy and import/export via bacpac. These options are discussed in detail later in this document.

With the elastic scale provided by the Azure platform, the supply curve can closely match the demand curve (rather than having a large amount of extra capacity to account for peak load).

With elastic scale, cost of goods is driven by:

  • How many scale units are employed in a solution; VMs, storage accounts, and so on (composing for scale)

  • How efficiently work is performed by those scale units.

We refer to how much work can be performed for a given amount of capacity as the density of the application. Denser services and frameworks allow a greater amount of work to be performed for a given resource deployment; that is to say improving density enables reduction in deployed capacity (and cost), or the ability to absorb additional load with the same deployed capacity. Density is driven by two key factors:

  • How efficiently work is performed within a scale unit. This is the traditional form of performance optimization – managing thread contention and locks, optimizing algorithms, tuning SQL queries.

  • How efficiently work is coordinated across scale units. In a world where systems are made up of larger numbers of smaller units, the ability to efficiently stitch them together is critical in delivering efficiency. This involves frameworks and tools that communicate across components, such as SOAP messaging stacks (such as WCF, ORM’s (such as Entity Framework), TDS calls (SQL client code), and object serialization (such as Data Contracts or JSON).

In addition to the traditional optimization techniques used against a single computer (or database), optimizing the distributed communication and operations is critical in delivering a scalable, efficient Azure service. These key optimizations are covered in detail in later sections:

  • Chunky not chatty. For every distributed operation (that is, one resulting in a network call) there is a certain amount of overhead for packet framing, serialization, processing, and so on. To minimize the amount of overhead, try to batch into a smaller number of “chunky” operations rather than a large number of “chatty” operations. Keep in mind that batching granular operations does increase latency and exposure to potential data loss. Examples of proper batching behavior are:

    • SQL. Execute multiple operations in a single batch.

    • REST and SOAP services (such as WCF). Leverage message-centric operation interfaces, rather than a chatty RPC style, and consider a REST-based approach if possible.

    • Azure storage (blobs, tables, queues). Publish multiple updates in a batch, rather than individually.

  • Impact of serialization. Moving data between machines (as well as in and out of durable storage) generally requires the data be serialized into a wire format. The efficiency (that is, the time taken and space consumed) of this operation quickly dominates overall application performance for larger systems.

    • Leverage highly efficient serialization frameworks.

    • Use JSON for communication with devices, or for interoperable (human readable) applications.

    • Use very efficient binary serialization (such as protobuf or Avro) for service-to-service communication when you control both endpoints.

  • Use efficient frameworks. There are many rich frameworks available for development, with large, sophisticated feature sets. The downside to many of these frameworks is that you often pay the performance cost for features you don’t use.

    • Isolate services and client APIs behind generic interfaces to allow for replacement or side-by-side evaluation (either through static factories, or an inversion of control container). For example, provide a pluggable caching layer by working against a generic interface rather than a specific implementation (such as Azure caching).

In the previous section we introduced the key design concepts and perspectives for building applications that take advantage of the cloud fabric provided by Azure. This section will explore the core platform services and features, illustrating their capabilities, scale boundaries and availability patterns.

As every Azure service or infrastructure component provides a finite capacity with an availability SLA, understanding these limits and behaviors is critical in making appropriate design choices for your target scalability goals and end-user SLA. Each of the core Azure services is presented in the context of four key pivots: features and their intent; density; scalability; and availability.

A Azure subscription is the basic unit of administration, billing, and service quotas. Each Azure subscription has a default set of quotas that can be increased by contacting support, and are intended to prevent accidental overages and resource consumption.

Each subscription has an account owner, and a set of co-admins, authorized through Microsoft Accounts (formerly Live IDs), who have full control over the resources in the subscription through the management portal. They can create storage accounts, deploy cloud services, change configurations, and can add or remove co-admins.

The Azure Management APIs (REST-based web services) provide an automation interface for creating, configuring and deploying Azure services (used by the management portal under the hood). Access to these APIs is restricted using management certificates.


Service Default Limit

Cloud Services


Storage Accounts




SQL Database logical servers


For the latest Azure subscription and service limits, see Azure Subscription and Service Limits, Quotas, and Constraints

A Azure cloud service (formerly hosted service) is the basic unit of deployment and scale. Each cloud service consists of two deployments (production and staging), each with a set of roles. The cloud service has a public DNS entry ( in the form of for the production deployment, and a staging deployment DNS entry (in the form of

Each deployment contains one or more roles, either a Web role or a Worker role, which in turn contains one or more instances (non-durable VMs). Each instance contains an identical, immutable, non-durable snapshot of a software package for that role (that is, all of the instances in a given role have an identical build deployed to them). These instances run a Azure-specialized version of Windows Server (with many services disabled by default for additional security, configured to work well with the Azure networking and service fabric, and so on), and by default are automatically patched by the Azure fabric. Live patching is handled through a rolling upgrade scheme, which is described below.

Cloud services can be deployed to any Azure data center, either directly (by choosing the destination region when creating the service) or through an affinity group. An affinity group is an indirect reference to a deployment destination that can be used to streamline deploying all components of an application to the same data center.

Web roles are preconfigured with an IIS instance, which hosts the application code. Application code hosted in worker roles executes in the preconfigured long-running application host. Each cloud service may have up to 25 roles. The roles default configuration is to execute .NET code, but a role can be configured to run any code compatible with Windows Server – Java, Python, Ruby, node.js, and so on. All of the platform features referred to in this document are available from any platform, but may require additional client-proxy development to target the REST-based APIs.

Within a cloud service, all instances are assigned private IP addresses (in the 10.x block); all outbound connections appear to come from a single virtual IP address, or VIP (which is the VIP of the cloud service deployment), through Network Address Translation. All inbound connections must pass through configured endpoints; these endpoints provide load-balanced access to internal roles and ports. For example, by default inbound HTTP/HTTPS (port 80 and 443) connections to the cloud service deployment are load balanced against all of the available instances of the primary web role.

Note that cross-service latency (that is, the traversing the NAT out of one cloud service and through the load balancer into another) is far more variable than the on-premise equivalent, and is one of the reasons batched or “chunky” cross-service connections are encouraged for scalability.

The Azure fabric also provides a configuration service available to all instances in the cloud service deployment. A static set of expected configuration keys is provided in the service definition (as part of the development cycle), with the initial set of configuration values deployed along with the service when it is published to Azure. This set of configuration values is available as a run-time lookup to all instances in the service deployment, and may be modified at run-time through a REST interface, the Azure portal, or a PowerShell script.

When the run-time configuration is changed, all instances may choose (in application code) to hook the configuration change notification and handle the configuration updates internally. If application code is not written to capture the configuration change event, all instances in the role will experience a rolling reboot to update their configuration.

The state of each instance is not durable; any configuration above the base Azure image (a specialized Windows Server VM) requires configuration on startup to create performance counters, tune IIS settings, install dependent software, and so on. These configuration scripts are typically run as a startup task defined by the cloud service configuration.

Inside of a cloud service, the Azure fabric provides information about the configuration, internal IP addresses, service configuration, and so on, is provided through the RoleEnvironment. All processes running on a Azure instance can access the RoleEnvironment information to retrieve configuration, discover the network topology, and so on. You can also use the Azure management APIs to access this information externally.

The Azure fabric exposes two core concepts for managing component failure, reconfiguration and upgrade/patching: upgrade domains and fault domains.

Upgrade domains are logical groupings within a Azure service; by default each service has five (5) upgrade domains (this value may be modified in the cloud service definition). Any service change or upgrade only affects a single upgrade domain at a time. Examples of these changes include patching the OS, changing the virtual machine size, adding roles or role instances to a running service, or modifying the endpoint configuration.

This allows live re-configuration of a running cloud service while maintaining availability. For roles containing only a single instance the Azure fabric cannot provide availability during upgrade operations, which is why running single-instance roles does not meet the Azure SLA.

Fault domains are logical groupings based on the underlying hardware. While not guaranteed to map directly to any particular hardware configuration, think of the logical grouping as the Azure fabric’s way of automatically separating instances from underlying resources that represents a single point of failure (such as a single underlying physical server, rack, and so on). To meet the service SLA, Azure needs to deploy instances to at least two fault domains. This is the other reason single-instance role deployments do not meet the Azure SLA.

In summary:

  • The basic unit of deployment and scale in Azure is the Cloud Service, consisting of a set of roles. Each role contains a set of identical role instances, each running a specialized cloud configured version of Windows Server.

  • In addition to the physical topology (roles and instances) and application code, cloud services define a service-wide configuration. This configuration may be updated at run-time.

  • Each role instance is non-durable (changes, files, and so on, are not guaranteed to be persisted in the event of reboot, patching, failure events).

  • Each cloud service exposes a single virtual IP for both inbound and outbound traffic. The cloud service exposes endpoints, which provide (round robin) load-balanced mapping to an internal role and port.

  • Azure uses upgrade domains to logically separate groups of instances, and providing rolling upgrades or modifications (while maintaining availability).

  • Azure uses fault domains to physically group instances away from single points of failure (such as running all instances on the same underlying physical machine).

  • Leverage multiple subscriptions to isolate development, test, staging and production environments.

Each role contains a set of one or more instances. Each of these instances is a VM, running a specialized version of Windows Server. Instances (VMs) are currently available in five sizes; extra-small through extra-large. Each of these sizes is allocated a certain amount of CPU, memory, storage and bandwidth.


Virtual Machine Size CPU Cores Memory Disk Space for Local Storage Resources in Web and Worker Roles Disk Space for Local Storage Resources in a VM Role Allocated Bandwidth (Mbps)



768 MB

19,480 MB

(6,144 MB is reserved for system files)

20 GB




1.75 GB

229,400 MB

(6,144 MB is reserved for system files)

165 GB




3.5 GB

500,760 MB

(6,144 MB is reserved for system files)

340 GB




7 GB

1,023,000 MB

(6,144 MB is reserved for system files)

850 GB




14 GB

2,087,960 MB

(6,144 MB is reserved for system files)

1890 GB


For the latest Azure subscription and service limits, see Azure Subscription and Service Limits, Quotas, and Constraints

Provided that two or more instances are deployed in different fault and upgrade domains, Azure provides the following SLAs for cloud services:

  • 99.95% external connectivity for Internet-facing roles (those with external endpoints)

  • 99.9% detect role instance issues within two minutes and start corrective actions

Both the role instance sizes and counts may be dynamically changed in a running application (note; changing role instance sizes will trigger a rolling redeployment). Given the scale-out approach for building Azure applications, bigger is not necessarily better when it comes to selecting instance sizes. This applies both to cost (why pay for what you don’t use) and performance (depending on whether your workload is CPU bound, I/O bound, and so on). Approaches for selecting the number of instances and instance sizes are reviewed in more detail in the Best Practices section of this document.

Azure Storage is the baseline durable data service for Azure, providing blobs (file), queue and table storage (key to values). The storage account is the basic unit of scale and availability, providing the following capabilities. All communication with the storage service is based on a REST interface over HTTP.

For the latest Azure subscription and service limits, see Azure Subscription and Service Limits, Quotas, and Constraints

The Azure Storage availability SLA guarantees that at least 99.9% of the time correctly formatted requests to add, update, read and delete data will be successfully and correctly processed, and that in addition storage accounts will have connectivity to the internet gateway.

These limits are shared between all usage of the individual storage account; i.e. the number of operations per second and overall bandwidth is shared between tables, blobs and queues. If an application exceeds the total number of operations per second, the service may return an HTTP code 503 server busy. Operations are specific to each storage aspect (tables, queues or blobs), and are described in the sub-sections below.

In terms of the shipping container metaphor used earlier, each storage account is a container that provides a certain amount of capacity. Exceeding the limit of a single account (shipping container) requires leveraging multiple accounts in the same application.

Azure Storage provides availability and resiliency by default; all writes or updates to Azure Storage are transparently and consistently replicated across three storage nodes (which live in different upgrade and fault domains). Access to Azure storage is controlled using single-factor authentication in the form of access keys. Each storage account has a primary and a secondary key, allowing for continuous availability when the primary key is rotated.Data in Azure Storage is automatically geo-replicated to a “mirror” data center (unless this feature is specifically disabled using the portal). Geo-replication is opaque, leveraging DNS redirect to fail over clients to the secondary location in case of failure on the primary data center.

Note that while Azure Storage provides data resiliency through automated replicas, this does not prevent your application code (or developers/users) from corrupting data through accidental or unintended deletion, update, and so on. Maintaining data fidelity in the face of application or user error requires more advanced techniques, such as copying the data a secondary storage location with an audit log. Blob storage provides a snapshot capability which can create read-only point in time snapshots of blob contents, which can be used as the basis of a data fidelity solution for blobs.

Azure Storage provides telemetry through its Storage Analytics feature, collecting and exposing usage data about individual storage calls to tables, queues and blobs. Storage Analytics needs to be enabled for each storage account with a collection policy (collect for all, only for tables, and so on) and a retention policy (how long to keep data).

Blob storage provides the file management service in Azure, delivering a highly available, cost efficient method for storing bulk unstructured data. The service provides two types of blobs:

  • Block blobs. Block blobs are designed for managing large blobs of data efficiently. Each block blob consists of up to 50,000 blocks, each of a maximum size of 4 MB (with an overall maximum block blob size of 200 GB). Block blobs support parallel upload for efficiently and concurrently moving large files over networks. Individual blocks may be inserted, replaced or deleted – but cannot be edited in place.

  • Page blobs. Page blobs are designed for efficiently providing random read/write operations (such as accessing a VHD). Each page blob has a maximum size of 1 TB, consisting of 512-byte pages. Individual or groups of pages may be added or updated, with an in-place overwrite.

The design limits for blob storage are listed in the table below. Remember that all of these operations count against the overall storage account limits.


Blob Category Limit

Max blob size (block)

200 GB (50k blocks)

Max block size

4 MB

Max blob size (page)

1 TB

Page size

512 bytes

Max bandwidth / blob

480 Mbps

Upon exceeding the size or bandwidth limits of a single blob, applications can write to multiple concurrent (or sequential) blob files. If your application exceeds the limits of a single storage account, leverage multiple storage accounts for additional capacity.

Azure queues provide an intermediary (brokered) messaging service between publishers and subscribers. Queues support multiple concurrent publishers and subscribers, but do not natively expose higher order messaging primitives such as pub/sub or topic-based routing. They are typically used to distribute work items (such as messages, documents, tasks, and so on) to a set of worker role instances (or between multiple hosted services, and so on).

Windows Azure Queues

Queue messages are automatically deleted after 7 days if not retrieved and deleted by an application. They provide decoupling between publishers and consumers of information; so long as both sides have the storage account key and queue name, they can communicate.


Queue Category Limit

Maximum messages in queue

N/A (up to storage account limit)

Max lifetime of a message

1 week (automatically purged)

Max message size

64 kB

Max throughput

~ 500 messages / second

Queues are intended to pass control messages, not raw data. If your messages are too large to fit in a queue, refactor your messages to separate data and command. Store the data in blob storage, with a reference (URI) to the data and the intent (that is, what to do with the data in blob storage) stored in a queue message.

To increase throughput within a single queue, batch multiple messages in a single message, then leverage the Update Message command to track progress of the encapsulating message’s tasks. Another technique is to place multiple messages in a blob with a pointer to the blob in the queue message.

If your application requires more throughput than provided by a single queue, leverage multiple concurrent queues. In this context, your application must implement appropriate partitioning and routing logic.

Azure table storage provides a highly durable, scalable consistent store for columnar (two-dimensional) data. It delivers a { partition key, row key } -> { data[] } semantic for storing and accessing data, as seen in the diagram below. Each table is broken down by partitions, which in turn contain entities. Each entity can have its own (flat) schema, or list of properties (columns).

Windows Azure Tables

Each partition supports up to 500 operations per second; in turn, each table supports up to the maximum of operations available in the storage account. Because each entity contains not only the actual data but also the columnar metadata (as every entity can have a different schema), long column names are not recommended, especially for large approach.


Table Category Limit

Max operations per second per partition


Max entity size (column names + data)

1 MB

Max column size (byte[] or string)

64 kB

Maximum number of rows

N/A (up to storage account limit)

Supported data types

byte[], Boolean, datetime, double, Guid, int32, int64, string

Individual entities (which you can think of as rows) have a maximum size of 1 MB, with individual columns being limited to a max of 64 kB. Supported data types are listed in the table above; for non-supported types (such as DateTimeOffset) a serialization proxy is required in your application code (for example, storing the DateTimeOffset in a standard string format).

Table storage provides access to stored data using keys associated with partitions and entities, partition scanning, or entity scanning. It supports filter projection, as a filter expression can be pushed to table storage as part of the query, and executed in table storage). Table storage does not provide secondary indexes, so any lookup not based on the partition key or entity key requires a table and/or a partition scan. For partitions containing a non-trivial number of entities, this typically has a dramatic impact on performance.

Any query processing that takes longer than 5 seconds returns a continuation token that the application can use to continue receiving the results of the query. Queries retrieving more than 1,000 entities must leverage a paging model to bring data back in chunks of 1,000 entities (which is natively supported by the table storage API).

The only query expressions currently supported against table storage are filtering and selection (selecting specific properties); table storage does not provide a server-side aggregation or grouping semantic. To build applications requiring rich aggregation or analytic capabilities, it is often a better choice to either store data in aggregated form or use a relational engine, such as Azure SQL Database. Some applications employ a hybrid approach, aggregating data from table storage to an ancillary SQL Database which is then used for query and reporting purposes.

Choosing an appropriate partitioning function is critical in efficient and effective use of table storage. There are two primary choices for the type of partitioning function:

  • Time. Commonly used for storing time-series data, such as Azure Diagnostics performance counters (a usage covered in the telemetry section in this document), time-based partitioning functions convert the current time into a value representing a window of time (the current minute, hour, and so on).

    These allow efficient lookup and location of a specific partition (as the filter clause for table storage supports >=, <=, and so on), but can be prone to throttling if the time window chosen is too narrow and a spike event occurs. For example, if the partition function chosen is the current minute, and a spike event occurs, too many clients may be attempting to write into the same partition concurrently. This not only affects the throughput on insert, but also the throughput on query.

  • Data. Data-centric partitioning functions calculate the partition value based on one or more properties of the data to be stored (or retrieved). Choosing an appropriate data-driven partitioning function depends on several factors – query patterns, density of partition keys (how many entities will end up in a partition), and unpredictable growth (it can be challenging to rebalance very large tables). Common patterns include:

    • Single-field. The partition key is a single field in the source data (such as a customer ID for order information).

    • Multi-field. Either the partition or row key is a composite (typically a concatenation) of multiple fields in the source data. When selecting partition keys, note that batch operations require all entities to be in the same partition (i.e. have the same partition key).

    • Calculated field. The partition key is calculated from one or more fields, based on a deterministic function. A common example of this would be distributing user profiles into multiple partitions. The user ID would be hashed using a hashing function designed for relatively uniform distribution, then the modulo is taken against the number of desired partitions.

Any non-trivial application will require use of multiple partitions. Even for tables with a small number of total entities (e.g. two hundred), if the application will make several thousand requests per second, multiple partitions will be necessary for throughput:

  • Single table / single partition. Simplest option (constant partition key value), suitable for small scale workloads with limited amounts of data and request throughput requirements (< 500 entities/sec).

  • Single table / multi-partition. The typical option for most deployments; carefully choose partition keys to align with the target query patterns.

  • Multi-storage account / multi-partition. If the load is projected to exceed 5,000 operations per second, use of tables spread across multiple storage accounts is required.

Once chosen rebalancing (repartitioning) data can be an expensive exercise, involving reading and copying all entities with new partition keys, then deleting the old data. Note that there is no minimum size restriction on a partition; partitions can consist of a single entity (that is, row).

If your application requires more throughput than provided by a single table (after careful partition selection), leverage multiple concurrent tables in different storage accounts. In this context, your application will need to implement appropriate routing logic to select the appropriate storage account.

The Azure Content Delivery Network (CDN) provides an efficient way to cache static content (either from blobs or application output) in a globally distributed caching network. This alleviates pressure on your application for delivering static content, and improves the overall end-user experience.

The Content Delivery Network availability SLA guarantees that the cached objects will be delivered with 99.9% availability on a monthly basis.

Using CDN requires that the feature be activated for your subscription. From there blob content (from publically available / anonymous access containers) and anonymous application output content (for example can be cached.

In general, for any large scale application, all commonly accessed static content (images, css, and so on) should be delivered via CDN with appropriate cache expiration policies.

For example, consider an online ebook store with 1 million titles. Including the content (images, and so on) for all titles in CDN will be very expensive (as the majority of content will not be accessed frequently, and be constantly expiring), whereas including only the top content (for example, the top 50 titles) would provide the right mix of caching compared to price.

One of the core elements of successfully delivering a large scale service is telemetry – insight into the operation, performance and end-user experience of your application. Telemetry approaches for Azure applications need to take into account both the scale-out / distributed nature of the platform, and the available platform services for collecting, analyzing and consuming telemetry.

The baseline technology component for gathering and understanding telemetry on Azure is Azure Diagnostics (WAD). WAD consists of an agent that gathers data from individual instances and forwards them to a central collection point (storage account) as well as a set of standard structures and conventions for storing and accessing data. The agent supports a number of configuration approaches, including code (.NET), a configuration file embedded within the deployed project code, or a centralized configuration file deployed to blob storage. The configuration is partially dynamic in the last instance, allowing updated diagnostic files to be pushed to blob storage, and then pulled down to the targeted agents.

Windows Azure Diagnostics

The WAD configuration provides for a number of data sources; each of these data sources is periodically gathered, batched through an Event Tracing for Windows (called ETW) session, and published to the target storage account. The agent takes care of transient connection issues, retries, and so on. The available data sources are:

  • Performance Counters. A subset of Performance Counters values (CPU, memory, and so on) captured into a local ETW session, and periodically staged over into table storage.

  • Windows Event Logs. A subset of Windows Event records (Application, System, and so on) captured into a local ETW session, and periodically staged over into table storage.

  • Azure Logs. Application logs (trace messages) published by application code (via System.Diagnostics.Trace) and captured by a DiagnosticMonitorTraceListener trace listener. These are published into the WAD Performance Counters table in the destination storage account.

  • IIS 7.0 logs. Standard IIS log information about requests logged by IIS (web roles only). Logs are collected into local files, and periodically staged over into blob storage.

  • IIS Failed request logs. Information from IIS about failed requests, collected into local files, and periodically staged into blob storage.

  • Crash Dumps. In the event of a system crash, logs about the state of the operating system are captured and published to blob storage.

  • Data Source. WAD can monitor additional local directories (such as log directories in local storage) and periodically copy the data into a custom container in blob storage.

Each of these data sources is configured with the subset of data to gather (for example, the list of performance counters) and the collection/publishing interval. There are also a number of PowerShell scripts available to change the run-time configuration or force an immediate publish of data from the agents to the target storage account.

Log telemetry into a separate storage account. Logging telemetry and application data into the same storage account will cause serious contention issues at scale.

Azure SQL Database provides database-as-a-service, allowing applications to quickly provision, insert data into, and query relational databases. It provides many of the familiar SQL Server features and functionality, while abstracting the burden of hardware, configuration, patching and resiliency.

SQL Database does not provide 1:1 feature parity with SQL Server, and is intended to fulfill a different set of requirements uniquely suited to cloud applications (elastic scale, database-as-a-service to reduce maintenance costs, and so on). For more information, see

The service runs in a multi-tenant shared environment, with databases from multiple users and subscriptions running on infrastructure built on commodity hardware (scale-out not scale-up).

Databases are provisioned inside of logical servers; each logical server contains by default up to 150 databases (including the master database). By default, each subscription can provision five (5) logical servers, with the ability to increase this quota, as well as the maximum number of databases per logical server, by calling support.

Each logical server is assigned a public, unique, generated DNS name (of the form [servername], with every logical server in a subscription sharing the same public IP address. Logical servers (and databases) are accessed through the standard SQL port (TCP/1433), with a REST-based management API accessing on port TCP/833.

By default, access to the logical server (and all of its databases) is restricted via IP-based firewall rules to the Azure management portal (rules can be set against the logical server or individual databases). Enabling access to Azure applications and direct application connectivity from outside of Azure (for example, to connect SQL Server Management Studio) requires configuring firewall rules. These rules may be configured through the Azure management portal by using a management service API call.

SQL Database provides support for most of the key features present in SQL Server, with several important exceptions including:

  • All tables must contain a CLUSTERED INDEX; data cannot be INSERT’ed into a table in SQL Database until a CLUSTERED INDEX has been defined.

  • No embedded Common Language Runtime (CLR) support, database mirroring, service broker, data compression, or table partitioning.

  • No XML indexes; XML data type is supported.

  • No support for transparent data encryption (TDE) or data auditing

  • No full-text search support

Each database, when created, is configured with an upper size limit. The currently available size caps are 1 GB, 5 GB, 10 GB, 20 GB, 30 GB, 40 GB, 50 GB, 100 GB and 150 GB (the currently available maximum size). When a database hits its upper size limit it rejects additional INSERT or UPDATE commands (querying and deleting data is still possible). The size of a database may also be created (increased or decreased) by issuing an ALTER DATABASE command.

As databases are billed based on the average size used per day, applications expecting rapid or unpredictable growth may elect to initially set the database maximum size to 150 GB. Scaling a database beyond 150 GB requires leveraging a scale-out approach, described in more detail in the section below.

SQL Database provides built-in resiliency to node-level failure. All writes into a database are automatically replicated to two or more background nodes using a quorum commit technique (the primary and at least one secondary must confirm that the activity is written to the transaction log before the transaction is deemed successful and returns). In the case of node failure the database automatically fails over to one of the secondary replicas. This causes a transient connection interruption for client applications – one of the key reasons why all SQL Database clients must implement some form of transient connection handling (see below for details on implementing transient connection handling).

The monthly availability SLA is 99.9% up-time, defined as the ability to connect to SQL Database within 30 seconds in a 5 minute interval. The failover events described in the prior paragraph typically occur in less than 30 seconds, reinforcing the need for your application to handle transient connection faults.

SQL Database provides insight into its health and performance through dynamic management views (DMVs); these DMVs contain information about key aspects of the system, such as query performance, database and table size, and so on. Applications are responsible for collecting and analyzing information from key DMVs on a periodic basis, and integrating into the wider telemetry and insight framework.

There are several business continuity (backup, restore) options available for SQL Database. Databases can be copied via the Database Copy functionality, or the DAC Import/Export Service. Database Copy provides transactional consistent results, while a bacpac (through the import/export service) does not. Both of these options run as queue based services within the data center, and do not currently provide a time-to-completion SLA.

Note that the database copy and import/export service place a significant degree of load on the source database, and can trigger resource contention or throttling events (described in the Shared Resources and Throttling section below). While neither of these approaches provides the same degree of incremental backup supported by SQL Server, a new feature is currently in preview to enable point-in-time restore capability. The point in time restore feature allows users to restore their database back to an arbitrary point in time within the past 2 weeks.

The only currently supported authentication technique is SQL authentication, single factor username/password logins based around registered users in the database. Active Directory or two-factor authentication capabilities are not yet available. Encryption of the connection to SQL Database is highly recommended, using the built-in encryption support present in ADO.NET, ODBC, and so on. Database level permissions are consistent with SQL Server. See Managing Databases and Logins in Azure SQL Database for a detailed discussion about the setting up Azure SQL Database security.

SQL Database provides a rich set of dynamic management views for observing query performance and database health. However, no automated infrastructure is provided for gathering and analyzing this data (and familiar tools such as direct-attached profiler and OS level performance counters are not available). Approaches to gathering and analysis are described in the Telemetry section of this document.

As described above, SQL Database is a multi-tenant service running on a shared infrastructure. Databases from different tenants share the underlying physical nodes, which is built out on top of commodity hardware. Other users of the system can consume key resources (worker threads, transaction log, I/O, and so on) on the same underlying infrastructure. Resource usage is governed and shaped to keep databases within set resource boundaries. When these limits are exceeded, either at a tenant or physical node level, SQL Database responds by throttling usage or dropping connections. These limits are listed in the table below.


Resource Maximum value per transaction / session Max value per physical node Soft throttling limit Hard throttling limit

Worker Threads





Database Size

Configured maximum size up to 150 GB per database



100%; inserts or updates not accepted after reaching limit

Transaction Log Growth

2 GB per transaction

500 GB



Transaction Log Length

20% of total log space (100 GB)

500 GB



Lock Count

1 million per transaction




Blocking System Tasks

20 seconds




Temp DB Space

5 GB



5 GB





16 MB over 20 seconds

Max concurrent requests

400 per database




When a transaction limit is reached the system responds by cancelling the transaction. When a database reaches a soft throttling limit, transactions and connections are slowed down or aborted. Reaching a hard throttling limit affects all databases (and users) on the underlying physical node, causing existing operations to be terminated and preventing new operations or connections until the resource falls below the throttling threshold.

Some of these throttling limits result in potentially non-intuitive limits of the design and performance of an application. For example, limiting the transaction log growth to a maximum of 2 GB per transaction prevents building an index on a large table (where building the index will generate more than 2 GB of transaction log). Techniques for performing such operations are discussed in the Best Practices section of this document.

Handling these types of throttling conditions and transient errors requires careful design and implementation of client code; addressing them requires scaling out the database tier to leverage multiple databases concurrently (scale-out is discussed in the next section).

SQL client application code should:

  • Implement retry code that is aware of the SQL error codes related to throttling, and provides appropriate back-off logic. Without some form of backoff logic in your application you can lock the database into a continuous throttling state by constantly pushing peak load against the database.

  • Log throttling errors, using the retry code to distinguish between transient connection, throttling and hard failure – syntax, missing sprocs, and so on. This will assist in tracking and troubleshooting application availability issues.

  • Implement a circuit breaker pattern. When an appropriately chosen retry policy has expired (balancing latency and system response against how often the application will retry), invoke a code path to handle the non-transient error (i.e. trip the breaker). The application code may then:

    • Fallback to another service or approach. If the application failed to insert new data into SQL Database (and the data did not need to be immediately available), the data could instead be serialized into a DataTable (or other XML/JSON format), and written to a file in blob storage. The application could then return a success code to the user (or API call), and insert the data into the database at a later point.

    • Fail silent by returning a null value, if the data or workflow was optional (will not impact end-user experience).

    • Fail fast by returning an error code if no useful/appropriate fallback mechanism is available.

SQL Database can deliver a large number of relatively small scale units (databases) with ease. Implementing highly scalable applications on Azure leveraging SQL Database requires a scale-out approach, composing the resources of multiple databases to meet variable demand. With applications being traditionally targeted towards the “titanium eggshell” of a single scale-up highly resilient database server, thoughtful design is required to shift applications into efficiently making use of a scale-out database service.

With SQL Database, as with the other core Azure services, scale-out and composition are the keys to additional scale (database size, throughput) and resources (worker threads, and so on). There are two core approaches to implementing partitioning/sharding (and hence scale-out) for SQL Database; these approaches are not mutually exclusive within an application:

  • Horizontal partitioning. In a horizontal partitioning approach, intact tables or data sets are separated out into individual databases. For example, for a multi-tenant application serving different sets of customers, the application may create a database per customer. For a larger single-tenant application the customer table may live in a different database than the orders table. The partitioning key is typically the tenant identifier (e.g. customer ID). In the diagram below the data set is horizontally partitioned out into three different databases, using a hash of the email as a partition value (that is, the partition key is the email, the partition function uses a hash of the key to map to a destination database).

    Horizontal partitioning.
  • Vertical partitioning. In a vertical partitioning approach a data set is spread across multiple physical tables or databases, based on schema partitioning. For example, customer data and order data may be spread across different physical databases. In the diagram below, the data set is vertically partitioned into two different databases. Core user information (name, email) is stored in DB1, with user profile information (such as the URI of the avatar picture) stored in DB2.

    Vertical partitioning.

Many applications will use a mix of horizontal and vertical partitioning (hybrid partitioning), as well as incorporating additional storage services. For example, in the example above the avatar pictures for users were stored as IDs in the database, which the application would expand to a full URL. This URL would then map to an image stored in a blob.

When working with a scaled out relational data store the availability calculation is very different. Systems with larger numbers of shards have a higher probability of some individual data segment being offline, with a much lower probability of the entire application being unavailable. Applications need to account for partial availability of backend data stores. With a scale-out data model, data availability is no longer an all-or-nothing condition.

Repartitioning data can prove challenging, especially if the usage model or distribution of data changes over time. Range-based partitioning keys, whether based on a fixed number (using a modulo of the hash of the partitioning values) or distribution of the partitioning values, require rebalancing data between individual shards. Range based partition schemes often leverage binary splits or merges to simplify rebalancing operations.

For example, fixed-range partitioning methods (such as first letter of last name) may start off as a balanced distribution, but can quickly shift towards being highly unbalanced as new users show up (each bringing their own last name, which may or may not be evenly distributed across the alphabet). Keep in mind the potential over time of the need over time for adjusting the partitioning mechanism and cost of rebalancing data.

Lookup based partitioning schemes are more challenging to implement, requiring a high-performance lookup mechanism for every data tenant or partition, but are more amenable to granular rebalancing as they allow a single tenant to be individually rebalanced to a new partition. They also allow additional capacity (new databases, and so on) to be added to the system without requiring copying of data.

Regardless of the mix of sharding approaches, moving to a scale-out of sharded relational databases carries with it certain restrictions requiring a different approach to data management and querying:

  • Traditional “good” SQL data storage and query designs have optimized for storage and consistency by leveraging heavily normalized data models. This approach assumes a globally coherent data space, leveraging cross references and JOIN's between tables. With data spread over physically distributed nodes, JOINs and cross references are only feasible within a single shard. SQL Database does not support distributed queries across multiple databases; data merging needs to be handled at the client tier and denormalization and replication of data across shards.

  • Reference data and metadata is typically centralized into reference tables. In a scale-out approach, these reference tables – and any data that cannot be separated by the common partitioning key -- need to be replicated across shards and kept consistent.

  • There is no practical way to deliver scalable, performant distributed transactions across shards; data (and schema updates!) will never be transactional consistent between shards. Application code needs to assume, and account for, a degree of loose consistency between shards.

  • Application code needs to understand the sharding mechanism (horizontal, vertical, type of partitioning) in order to be able to connect to the appropriate shard(s).

  • Common ORM's (such as Entity Framework) do not natively understand scale-out data models; applications which make extensive use of large ORM's may require significant redesign to be compatible with horizontal sharding. Designs isolating tenants (customer sets) in a vertically sharded approach to a single database generally require less redesign at the data access layer. The downside to the pure-vertical shard model is that each individual shard is limited by the capacity of a single database.

  • Queries which need to touch (read or write) multiple shards need to be implemented with a scatter/gather pattern, where individual queries are executed against the target shards, and the result set aggregated at the client data access layer.

Scaling out with SQL Database is done by manually partitioning or sharding data across multiple SQL. This scale-out approach provides the opportunity to achieve near linear cost growth with scale. Elastic growth or capacity on demand can grow with incremental costs as needed. Not all applications can support this scale-out model without significant redesign.

  • Schema updates are not guaranteed to be transactional consistent, especially when updating a large number of shards. Applications either need to accept planned downtime periods, or account for multiple concurrent versions of deployed schema.

  • Business continuity processes (backup/restore, and so on) need to account for multiple data shards.

The design recommendations and best practices for addressing these challenges are covered in the Best Practices section of this document.

The rest of this document focuses on illustrating best practices for delivering highly scalable applications with Azure and SQL Database, based on real-world experiences and the education they provide. Each best practice discusses the targeted optimization and components, the implementation approach, and the inherent tradeoffs. As with any best practice, these recommendations are highly dependent on context in which they are applied. Evaluate each best practice for fit based on the platform capabilities discussed in the previous section.

These experiences are drawn from a number of customer engagements that do not follow the classic OLTP (online transaction processing) pattern. It’s important to realize that some of these best practices may not be directly applicable to applications requiring strong or strict data consistency; only you know the precise business requirements of your application and its environment.

Each best practice will relate to one or more optimization aspects:

  • Throughput. How to increase the number of operations (transactions, service calls, and so on) through the system and to reduce contention.

  • Latency. How to reduce latency, both on aggregate and individual operations.

  • Density. How to reduce contention points when composing services in both direct contexts (for example, application code to SQL Database) and aggregate contexts (leveraging multiple storage accounts for increased scale).

  • Managability. Diagnostics, telemetry and insight – how to understand the health and performance of your deployed services at scale.

  • Availability. How to increase overall application availability by reducing the impact of failures points and modes (availability-under-load is covered under throughput/latency/density).

As the baseline unit of scale in Azure, careful design and deployment of hosted services is critical to providing highly scalable, available services.

  • The number of instances and upgrade domains within a hosted service can have a dramatic impact on the time required to deploy, configure, and upgrade the hosted service. Balance the performance and scalability benefits with the increased complexity required to achieve those benefits. Improving scalability and flexibility typically increases the development and management costs of a solution.

  • Avoid single-instance roles; this configuration does not meet the requirements for the Azure SLA. During a node failure or upgrade event, a single-instance role is taken offline. Limit their use to low-priority “maintenance” tasks.

  • Every data center has a finite (although large) capacity, and can serve as a single point of failure under rare circumstances. Services requiring the highest level of scale and availability must implement a multi-data center topology with multiple hosted services. However:

  • When the highest level of availability is not required (see the preceding item), ensure that applications and dependent services are wholly contained within a single data center. In situations where the solution must use multiple data centers, use the following guidelines:

    • Avoid cross-datacenter network calls for live operations (outside of deliberate cross-site synchronization). Long-haul latency between datacenters can be highly variable and produce unexpected or undesirable application performance characteristics.

    • Have only the minimum required functionality that accesses services in another datacenter. Typically, these activities relate to business continuity and data replication.

For large scale distributed applications, access to stateful application data is critical. Overall application throughput and latency is typically bounded by how quickly required data and context can be retrieved, shared, and updated. Distributed cache services, such as Azure Caching and memcached, have evolved in response to this need. Applications should leverage a distributed cache platform. Consider the following guidelines:

  • Leverage a distributed caching platform as a worker role within your hosted service. This close proximity to the clients of the cache reduces latency and throughput barriers presented by load balancer traversal. In-Role Cache on Azure Cache hosts caching on worker roles within your cloud service.

  • Use the distributed caching platform as the primary repository for accessing common application data and objects (for example, user profile and session state), backed by SQL Database or other durable store in a read-through or cache-aside approach.

  • Cache objects have a time-to-live which affects how long they are active in the distributed cache. Applications either explicitly set time-to-live on cached objects or configure a default time-to-live for the cache container. Balance the choice of time-to-live between availability (cache hits) versus memory pressure and staleness of data.

  • Caches present a key->byte[] semantic; be aware of the potential for overlapping writes to create inconsistent data in cache. Distributed caches do not generally provide an API for atomic updates to stored data, as they are not aware of the structure of stored data.

    • For applications requiring strict consistency of concurrent writes, use a distributing caching platform that provides a locking mechanism for updating entities. In the case of Azure Caching, this can be implemented via GetAndLock/ PutAndUnlock. Note: this will have a negative impact on throughput.

  • Cache performance is bounded on the application tier by the time required to serialize and deserialize objects. To optimize this process, leverage a relatively symmetrical (same time required to encode/decode data), highly efficient binary serializer such as protobuf.

    • To use custom serialization successfully, design data transfer objects (DTOs) for serialization into cache, use proper annotation for serialization, avoid cyclical dependencies, and utilize unit tests to track efficient serialization.

By default, connections between service tiers (including inbound connections through the load balancer) are subject to a round-robin allocation scheme, with limited connection pinning. The following diagrams illustrate the typical connection mesh that results between tiers and external services (the left side illustrates a typical web-tier only application).While this mesh presents no substantial performance issues for lightweight connection protocols (such as HTTP), certain connections are either expensive to connect/initialize or are a governed (limited) resource. For example, SQL Database connections belong in this category. To optimize usage of these external services and components, it is highly recommended to affinitize resource calls against specific instances.

Connection affinity

In the diagram above, the topology on the right has separate web and worker tiers (roles) within the same hosted service. This topology has also implemented affinity between the web and application layers to pin calls from specific application instances to specific databases. For example, to request data from Database One (DB1), web instances must request the data via application instances One or Two. As the Azure load balancer currently only implements a round-robin technique, delivering affinity in your application does require careful design and implementation.

  • Design the application with separate web and application tiers, delivering partition or resource-aware affinity between the web and application tier.

  • Implement routing logic that transparently routes intra-service calls to a target application instance. Use the knowledge of the partitioning mechanism used by external or downstream resources (such as SQL Database).

Practical implementation of this multitier architecture requires highly efficient service communication between the web and application tiers using lightweight protocols and efficient.

Development techniques for a Azure application are not fundamentally different than development techniques for Windows Server. However, the elastic fabric highlights the need and benefit to employ efficient code that most effectively uses computing resources.

  • Assume that all services, network calls, and dependent resources are potentially unreliable and susceptible to transient and ongoing failure modes (for example, how to implement retry logic for SQL Database is covered below in this topic):

    • Implement appropriate retry policies on all service calls (SQL Database, storage, and so on) to handle transient failure conditions and connectivity loss.

    • Implement backoff policies in the retry logic to avoid “convoy” effects (retries stacking up on the service that prolongs outages).

    • Implement rich client-side telemetry for logging error messages and failure events with contextual information (destination service, user/account context, activity, and so on).

  • Do not directly create threads for scheduling work; instead leverage a scheduling and concurrency framework such as the .NET Task Parallel Library. Threads are relatively heavyweight objects and are nontrivial to create and dispose. Schedulers that work against a shared thread pool can more efficiently schedule and execute work. This architecture also provides a higher level semantic for describing continuation and error handling.

  • Optimize data transfer objects (DTOs) for serialization and network transmission. Given the highly distributed nature of Azure applications, scalability is bounded by how efficiently individual components of the system can communicate over the network. Any data passed over the network for communication or storage should implement JSON text serialization or a more efficient binary format with appropriate hints to minimize the amount of metadata transferred over the network (such as shorter field names “on the wire”).

    • If interoperation is important, use an efficient textual format, such as JSON, for interoperability and in-band metadata.

    • If higher throughput is important, such as in service-to-service communication in which you control both ends, consider a highly efficient packed binary format, such as bson or protobuf.

      • Avoid chatty (frequent) data transfer of small objects. Chatty communication between services wastes substantial system resources on overhead tasks and is susceptible to variable latency responses.

      • Tests for serializing and deserializing objects should be a core component of your automated test framework. Functionality testing ensures that data classes are serializable and that there are no cyclical dependencies. Performance testing verifies required latency times and encoding sizes.

  • Where practical, leverage lightweight frameworks for communicating between components and services. Many traditional technologies in the .NET stack provide a rich feature set which might not be aligned with the distributed nature of Azure. Components that provide a high degree of abstraction between intent and execution often carry a high performance cost.

    • Where you do not require protocol interoperation or advanced protocol support, investigate using the ASP.NET Web API instead of WCF for implementing web services.

    • Where you do not require the rich features of the Entity Framework, investigate using a micro-ORM such as Dapper to implement the SQL client layer.

  • Reduce the amount of data delivered out of the data center by enabling HTTP compression in IIS for outbound data.

  • Affinitize connections between tiers to reduce chattiness and context switching of connections.

  • To reduce load on the application, use blob storage to serve larger static content (> 100 kB).

  • To reduce load on the application, use the Content Delivery Network (CDN) via blob storage to serve static content, such as images or CSS.

  • Avoid using SQL Database for session data. Instead, use distributed cache or cookies.

Azure Storage is the durable data backbone in a Azure application. While providing a highly reliable and scalable experience “out of the box,” large-scale applications require appropriate design and usage guidelines.

  • Leverage multiple storage accounts for greater scalability, either for increased size (> 100 TB) or for more throughput (> 5,000 operations per second). Ensure that your application code can be configured to use multiple storage accounts, with appropriate partitioning functions to route work to the storage accounts. Design the ability for adding additional storage accounts as a configuration change, not a code change.

  • Carefully select partitioning functions for table storage to enable the desired scale in terms of insert and query performance. Look to time-based partitioning approach for telemetry data, with composite keys based on row data for non-temporal data. Keep partitions in an appropriate range for optimal performance; very small partitions limit the ability to perform batch operations (including querying), while very large partitions are expensive to query (and can bottleneck on high volume concurrent inserts).

    • The choice of partitioning function will also have a foundational impact on query performance; table storage provides efficient lookup by {partition key, row key}, with less efficient processing of {partition key, row matches filter}, and {partition key matches filter, row key matches filter}. Queries requiring a global table scan ({row key matches filter}).

    • Partitions can be as small as a single entity; this provides highly optimized performance for pure lookup workloads such as shopping cart management.

  • When possible, batch operations into storage. Table writes should be batched, typically through use of the SaveChanges method in the .NET client API. Insert a series of rows into a table, and then commit the changes in a single batch with the SaveChanges method. Updates to blob storage should also be committed in batch, using the PutBlockList method. As with the table storage API, call PutBlockList against a range of blocks, not individual blocks.ReplaceThisText

  • Choose short column names for table properties; as the metadata (property names) are stored in-band. The column names also count towards the maximum row size of 1 MB. Excessively long property names are wasteful of system resources.

As covered in the Exploring Azure section, SQL Database provides turn-key relational database as a service capability, enabling scalable access to data storage in a scale-out fashion. Successful use of SQL Database in large-scale applications requires several careful design and implementation choices; key design points and best practices are addressed in this section.

Many applications require a metadata table to store details such as routing, partitioning, and tenant information. Storing this metadata in a single database creates both a single point of failure, as well as a scalability bottleneck. Central metadata stores should be scaled-out through a combination of:

  • Aggressive caching. Information in the configuration database should be aggressively cached into a distributed cache (such as memcached or Azure Caching).

    • Beware of the effect of aggressively attempting to preload the cache from multiple worker threads on application startup, which typically leads to excessive load and database throttling. If your application requires a preloaded cache, delegate the responsibility of loading data to a dedicated worker role (or scheduled task) with a configurable loading rate.

    • If application performance or reliability is dependent on having a certain segment of data available in the cache, your application should refuse incoming requests until the cache has been prepopulated. Until the data is populated, the application should return an appropriate error message or code.

  • Scale-out. Partition the data in either vertically (by table) or horizontally (segment table across multiple shards) to spread the load across multiple databases.

The overall scalability of a partitioned SQL Database is bounded by the scale of an individual database (shard) and how efficiently and effectively these shards can be composed together for increased scale:

  • Because the transaction log limits restrict larger transactions, such as rebuilding indexes, individual tables should not exceed roughly 10 GB (note that this practical limit is dependent on the size of the target table indexes, so it might be more or less than 10 GB for your database). For large individual tables, break down the table into individual smaller tables, and use a partitioned view to provide a uniform overlay.

    • Keeping individual table sizes small reduces the impact of a schema change or an index rebuild during a phased upgrade. Changes to multiple smaller tables minimize downtime and latency due to blocking operations.

    • This partitioning complicates management techniques. Operations such as rebuilding index must be executed in an iterative fashion over all component tables.

  • Keep individual databases (shards) reasonably small. Continuity operations such as DB Copy or exports against databases larger than 50 GB can take hours to complete (the service cancels operations that run for more than 24 hours).

In a world of continuous service delivery, managing distributed database updates or schema modifications requires due care and attention for a smooth upgrade path. All of the traditional best practices for control of schema and metadata operations in a production data store are more important than ever. For example, it is a much more complex task to debug and resolve an accidentally dropped stored procedure in 1 of 100 databases.

Because schema updates and data modifications are not transactionally consistent across shards, application updates must be compatible with both old and new schemas during the transition period. This requirement typically means that each release of the application must be at least compatible with the current version and the previous version of the schema.

Shifting to a scale-out collection of databases creates challenges around connection management. Each SQL database connection is a relatively expensive resource, evidenced by the extensive use of connection pooling in client APIs (ADO.NET, ODBC, PHP, and so on). Instead of each application instance maintaining multiple connections to a central SQL Server, each application instance must potentially maintain connections to multiple database servers.

With connections as an expensive and potentially scarce resource, applications must properly manage connections by returning pooled connections in a timely manner. Application code should use automatic connection disposal; in .NET the best practice is to wrap all use of SqlConnection inside a using statement, such as:

using (var conn = new SqlConnection(connStr))
    // SQL client calls here

As previously described, connections against a SQL Database are subject to transient connection faults. Retry logic must be used on all connections and commands to protect against these transient faults (see below for additional details).

SQL Database only supports TCP connections (not named pipes), and it is highly recommended to encrypt the connection between application code and SQL Database. To prevent unintended connection attempts (such as attempting to use named pipes) applications should format their SQL connection strings as follows:

Server=tcp:{servername},1433;Database={database};User ID={userid}@{database};Password={password};Trusted_Connection=False;Encrypt=True;Connection Timeout=30;

For large-scale application deployments, the default number of potential connections can grow exponentially between a hosted service deployment and SQL Database logical servers in a SQL Database cluster (each of which has a single external IP address). Take for example a hosted service with 100 instances, 50 databases, and the default number of connections, which is 100 connections by default in ADO.NET.

MaxConnections=DatabaseCount*Instance Count*MaxConnectionPoolSize

Refer to the networking topology for a hosted service; each side of the connection (the hosted service and the SQL Database logical servers) lives behind a Azure load balancer. . Each Azure load balancer has an upper limit of 64k connections between any two IPv4 addresses. The combination of this network topology with the default number of available connections results in severe network failures for larger applications.

  • Deploy larger applications across multiple hosted services.

  • Deploy databases in multiple subscriptions (not only multiple logical servers in the same subscription) to gain more unique IP addresses.

  • Implement multitier applications to affinitize outbound operations to a targeted application instance (see the previous section on hosted services).

  • Remember that connection pools are maintained per unique connection string and correspond to each unique database server, database, and login combination. Use SQL client connection pooling, and explicitly limit the maximum size of the SQL connection pool. The SQL client libraries reuse connections as necessary; the impact of a small connection pool is the potential for increased latency while applications wait for connections to become available.

The following list provides recommendations for reducing the number of required active connections:

Shifting to a distributed data model can require changes to the design of database schema, as well as changes to certain types of queries. Applications that require the use of distributed transactions across multiple databases potentially have an inappropriate data model or implementation (for example, trying to enforce global consistency). These designs must be refactored.

Use of a central sequence generation facility should be avoided for any nontrivial aspect of the application, due to availability and scalability constraints. Many applications leverage sequences to provide globally unique identifiers, using a central tracking mechanism to increment the sequence on demand. This architecture creates a global contention point and bottleneck that every component of the system would need to interact with. This bottleneck is especially problematic for potentially disconnected mobile applications.

Instead, applications should leverage functions which can generate globally unique identifiers, such as GUIDs, in a distributed system. By design, GUIDs are not sequential, so they can cause fragmentation when used as a CLUSTERED INDEX in a large table. To reduce the fragmentation impact of GUIDs in a large data model, shard the database, keeping individual shards relatively small. This allows SQL Database to defragment your databases automatically during replica failover.

Client application code must account for several aspects of delivering a distributed data model:

  • Partition keys. Partition keys should be part of every data class or model, and potentially decorated with an attribute to allow partition key discovery.

  • Telemetry. The data access layer should automatically log information about every SQL call, including destination, partition, context, latency, and any error codes or retries

  • Distributed queries. Executing cross-shard queries introduces several new challenges, including routing, partition selection, and the concept of partial success (some individual shards successfully return data, while others do not). The data access layer should provide delegates for executing distributed queries in an asynchronous (parallel) scatter-gather fashion, returning a composite result. Distributed queries also need to account for limiting underlying resources:

    • The maximum degree of parallelism (to avoid excessive thread and connection pressure).

    • The maximum query time (to reduce the overall latency hit from a long-running query or slow shard).

There are also several ongoing management tasks that must be performed:

  • Reference tables. Without a globally coherent query space, reference data for JOINs in queries must be copied to each individual shard. Maintaining and replicating data to the reference tables in individual shards is required to provide reasonably consistent local reference data.

  • Rebalancing partitions. Individual partitions can become unbalanced, either consuming too many resources and becoming a chokepoint, or presenting insufficient utilization and wasting resources. In these situations, the partitions must be rebalanced to reallocate resources. The rebalancing mechanism is heavily dependent on the partitioning strategy and implementation. In most scenarios rebalancing typically involves:

    • Copying the extant data in a shard to one or more new shards, either through split/merge mechanism (for range-based partitioning) or an entity-level copy and remap (for lookup-based partitioning).

    • Updating the shard map to point to the new shard, and then compensating for any data written to the old shards during the transition.

  • Data Pruning. As an application grows and gathers data, consider periodic pruning of older unused data to increase available headroom and capacity in the primary system. Pruned (deleted) data is not synchronously purged from SQL Database; rather, it is flagged for deletion and cleaned up by a background process. A common method of flagging data for deletion is a flag column that can signal a row as active, inactive or flagged for deletion. This allows masking data from queries for a period of time, allowing data to be easily moved back into production if users indicate that the data is still necessary.

    Deleting data can also trigger fragmentation, potentially requiring an index rebuild for efficient query operations. Older data can be archived to:

    • An online store (secondary SQL Database); this increases headroom in the primary system but does not reduce cost.

    • An offline store, such as a bcp or bacpac file in blob storage; this increases headroom in the primary system and reduces cost.

    • The bit bucket. You can choose to delete data permanently from the primary system to increase headroom.

Several common occurrences in SQL Database can trigger transient connection faults, such as replica failover. Applications must implement appropriate code to handle transient faults and to respond properly to resource exhaustion and throttling:

  • Transient connection fault handling with retry. Data access code should leverage a policy-driven retry mechanism to compensate for transient connection faults. The retry mechanism should detect temporary connection faults, reconnect to the target SQL Database, and reissue the command.

  • Handling throttling with retry and back-off logic. Data access code should leverage a policy driven retry and back-off mechanism for dealing with throttling conditions. The retry mechanism should detect throttling and gradually back off attempts to reissue the command (to avoid convoy effects that could prolong the throttling condition).

    • Data access code should also implement the ability to back-off to an alternate data store, such as blob storage. This alternate store provides a durable mechanism for capturing activities, data, and state, avoiding data loss in the case of a prolonged throttling or availability event.

.NET applications can use SQL Database aware retry and backoff logic frameworks, such as Cloud Application Framework (CloudFx) or the Enterprise Library Transient Fault Handler. These frameworks provide wrappers for common data access classes (such as SqlConnection and SqlCommand), as well as policies that can be directly invoked.

var retryPolicy = RetryPolicy.Create<SqlAzureTransientErrorDetectionStrategy>(
    retryCount: 3, 
    initialInterval: TimeSpan.FromSeconds(5),
    increment: TimeSpan.FromSeconds(2));
using (var conn = new ReliableSqlConnection(connStr))
    using (var cmd = conn.CreateCommand())


The previous code snippet demonstrates using the CloudFx ReliableSqlConnection class to handle transient connection errors when working against SQL Database.

Give specific care and attention to logging API calls to services, such as SQL Database, with potentially complicated failure modes. Attempt to capture key pieces of context and performance information. For example, all SQL Database sessions carry a session identifier, which can be used in support calls to assist with directly isolating the underlying problem. It is highly recommended that all calls, commands, and queries into SQL Database log:

  • Server and database name. With potentially hundreds of databases, the destination server is critical in tracking and isolating problems.

  • As appropriate, the SQL stored procedure or command text. Be careful not to leak sensitive information into the log file – generally avoid logging command text.

  • The end-to-end latency of the call. Wrap the call in a timing delegate by using Stopwatch or another lightweight timer.

  • The result code of the call (success or failure) as well as the number of retries and the failure cause (connection dropped, throttled, and so on).

  • The session ID of the connection, accessible through the CONTEXT_INFO() property (or the SessionTracingId property if using ReliableSqlConnection). However, do not retrieve the session ID property from the CONTEXT_INFO() on every client call, because this command triggers another round trip to the server.

To make more efficient use of SQL Database, queries and data inserts against SQL Database should be batched and executed asynchronously when possible. This not only improves application efficiency, but reduces overall system load on SQL Database (allowing more throughput).

  • Batched insert. For continuous data insert operations, such as registering new users, data should be batched and aligned against target shards. These batches should be written, and periodically (asynchronously) to SQL Database based on triggers, such as a target batch size or time window. For batch sizes under 100 rows, table-valued functions are typically more efficient than a bulk copy operation.

  • Avoid chatty interfaces. Reduce the number of round-trips required to the database to perform a query or set of operations. Chatty interfaces have a high degree of overhead, which increases the load on the system, reducing throughput and efficiency. Try to merge related operations, typically using a stored procedure to reduce round-trips.

As part of an overall approach to business continuity, data stored in SQL Database should be periodically exported to blob storage. Blob storage supports availability and geo-replication by default.

Implement a scheduled task which periodically exports databases into blob storage. Use a dedicated storage account. This task should run in the same data center as the target databases, not from an on-premises desktop or server.

Schedule the export task for low-activity times to minimize impact on the end-user experience. When exporting multiple databases, limit the degree of parallel export to reduce system impact.

Insight into the health, performance, and headroom of your SQL Databases is a critical component of overall service delivery. SQL Database provides the raw information required through dynamic management views, but there is not currently a turn-key infrastructure for capturing, analyzing, and reporting key metrics. To deliver this capability for SQL Database, consider the following practices:

  • Implement a periodic task to gather key performance data related to system load, resources used (worker threads, storage space), and data into a common repository. For example, consider the tables used by Azure Diagnostics. This task must gather data from all of the databases that belong to your application. This collection, typically occurs in a scale-out fashion (gathering data from multiple databases simultaneously).

  • Implement a periodic task to aggregate this information to build key performance indicators of the health and capacity of your deployed databases.

Azure Diagnostics provides a baseline for gathering application and instance-level telemetry. However providing insight into large-scale applications that run on Azure requires careful configuration and management of data flows. Unlike centralized scale-up applications that can leverage rich Windows diagnostics utilities, diagnostics in a scale-out distributed system must be implemented before the system goes live –they cannot be an afterthought.

Error handling, contextual tracing, and telemetry capture are critical for providing an understanding of error events, root causes, and resolutions.

  • Do not publish live site data and telemetry into the same storage account. Use a dedicated storage account for diagnostics.

  • Create separate channels for chunky (high-volume, high-latency, granular data) and chatty (low-volume, low-latency, high-value data) telemetry.

    • Use standard Azure Diagnostics sources, such as performance counters and traces, for chatty information.

    • Use common logging libraries, such as the Enterprise Application Framework Library, log4net or NLog, to implement bulk logging to local files. Use a custom data source in the diagnostic monitor configuration to copy this information periodically to blob storage.

  • Log all API calls to external services with context, destination, method, timing information (latency) and result (success/failure/retries). Use the chunky logging channel to avoid overwhelming the diagnostics system with telemetry information.

  • Data written into table storage (performance counters, event logs, trace events) are written in a temporal partition that is 60 seconds wide. Attempting to write too much data (too many point sources, too low a collection interval) can overwhelm this partition. Ensure that error spikes do not trigger a high volume insert attempt into table storage, as this might trigger a throttling event.

    • Choose high-value data to collect from these sources, including key performance counters, critical/error events, or trace records.

    • Choose an appropriate collection interval (5 min – 15 min) to reduce the amount of data that must be transferred and analyzed.

  • Ensure that logging configuration can be modified at run-time without forcing instance resets. Also verify that the configuration is sufficiently granular to enable logging for specific aspects of the system, such as database, cache, or other services.

Azure Diagnostics does not provide data collection for dependent services, such as SQL Database or a distributed cache. To provide a comprehensive view of the application and its performance characteristics, add infrastructure to collect data for dependent services:

  • Gather key performance and utilization data from dependent services, and publish into the WAD repository as performance counter records.

    • Azure Storage via Azure Storage Analytics.

    • SQL Database via dynamic management views.

    • Distributed cache via performance counters or health monitoring APIs.

  • Periodically analyze the raw telemetry data to create aggregations and rollups (as a scheduled task).

© 2015 Microsoft