SALES: 1-800-867-1380

Minimal Disruption for Azure SQL Database During Application Rolling Upgrades

Authors: Rama Ramani

Contributors: Sanjay Mishra, Mark Simms, Sasha Nosov, Paul Stirpe (Database Architect, Finomial)

Reviewers: Silvano Coriani, Aaron Lower, Shaun Tinline-Jones, Suren Machiraju, Christian Martinez, Shep Shepherd, Jason Roth, Karthika Raman, Bill Wilder (Principal Azure Consultant, Development Partners Software Corporation), Neil Mackenzie (Azure Lead, Satory Global), Michael Collier (Principal Cloud Architect, Aditi Technologies)

Problem Definition

The term ‘application’ is broad and in this context refers to app components (front ends, services refers to rolling out a new version of the application, with changes in app and data components, applied in the live site. To decompose the problem, consider just the database layer and look at a few approaches of addressing it today:

  1. Performing an online upgrade

  2. Build out another cluster, set it up with continuous replication, take a small downtime (order of minutes) and switch the connection string

  3. Take a “significant” downtime (a few hours) and perform the upgrade

  4. Leveraging database replicas

Well, if the changes are really simple, and can be done online, it is indeed the best option. In certain cases, where taking a “small” downtime accommodated by your SLA, the approach may even work. If a set of secondary databases (in active passive mode) are deployed, it is also not straightforward to. For instance, while the primary database is taken offline and upgraded, one of the replicas is made active. What happens to the new active transactions in the replica that need to be synchronized with the upgraded primary database?

Now, consider the scenario where the changes are significant and taking a downtime is not desired, how can one approach the upgrade? It is indeed a hard problem to solve. Further, Azure SQL Database is the platform as a service offering, so any functionality available in SQL Server running in Virtual Machines is not available. So, whichever strategy you choose, it will potentially cause some kind of a disruption. What if, the site is turned to “read-only” while the upgrade is being performed? Is this a disruption? Yes, but limited functionality allowing only reads is a better option than downtime.

noteNote
The methodology shown below has the requirement that the solution be in “read-only” mode while the rolling upgrade is performed.

The objective of this white paper is to discuss a set best practices to minimize such disruptions and successfully roll out the new application version. The paper discusses the approaches across a set of common deployment topologies and summarizes the benefits and tradeoffs in each approach. The topology with the minimal disruption (with possibly higher costs) is provided first and then subsequently other topologies with more trade-offs are provided.

In order to determine the right approach for your scenario, look at the deployment topology, the RTO for each approach, the effect on the Factors for consideration and finally the impact on your SLA to make the right decision.

Though the best practices are broadly applicable, this white paper applies these concepts to Azure SQL database. And the white paper does not discuss on how the changes must be in applied since that is custom to each application.

In subsequent articles, this methodology will be extended to other components.

Constraints and Assumptions

Based on some of the real life customer challenges, this paper has the following operating constraints and assumptions:

  1. New version changes are significant and cannot be made while the app and data components are online with full functionality. Application will be in ‘read-only’ mode during the upgrade process. For the cloud service, this change will be made via a service configuration flag setting and the app components run time code will check this flag setting while accessing the data components. Write operations to Azure SQL DB will be blocked. The database will be made “read-only” using T-SQL:

    USE [master]
    GO
    ALTER DATABASE [TESTDB] SET READ_ONLY 
    GO
    
    After the successful upgrade, the database will be made “read write” using T-SQL:

    USE [master]
    GO
    ALTER DATABASE [TESTDB] SET READ_WRITE 
    GO
    
  2. There is also an assumption that there is application-wide telemetry that tracks writes operations, which are used by a monitoring infra-structure that can accurately detect the application being in “read-only” mode.

  3. All database access from the application is via external connection strings. In the approaches below where staging deployments, the database connection strings are modified appropriately.

  4. Multi data center deployment is setup with an Active Geo Replication strategy for Azure SQL database.

noteNote
Please refer to this FAQ to understand the Premium SKU feature and quota restrictions.

Factors for consideration

  1. Upgrade Recovery Time Objective (RTO): Duration taken by the upgrade process to re-establish the application availability with full feature set (read write mode).

  2. Availability during the upgrade process.

  3. Ability to roll back in case of any errors during upgrade.

  4. Total dollar cost needed. This includes specific SKU versions due to feature dependency (such as SQL premium SKU for Active Geo-Replication) and incremental costs for temporary deployments used by the upgrade process. The temporary deployments will be deleted once the upgrade is complete.

Deployment Topology: Multiple Regions

In the next set of sections, we will refer to deployment in two regions, referred to as ‘Site A’ and ‘Site B’. For example, Site A may be deployed in US-West while Site B may be in US-East regions.

WarningWarning
Active Geo-Replication is only supported for Premium Databases. Both the primary and the active replica must be Premium databases. The active replica must be configured to have the same or larger performance level as the primary. Please look at http://msdn.microsoft.com/en-us/library/azure/dn741339.aspx for more details.

Upgrade of Site A and Site B together, with DR leveraging existing Active Geo-Replication relationship and using ‘Staging’ slot for cloud service & leveraging additional secondaries

In this topology, Site A serves as the primary while Site B serves as the DR location. Azure Traffic Manager (WATM) is configured with Failover policy and includes the cloud services endpoint from both locations. In order to maintain DR during the upgrade process, both sites have a temporary cloud service deployment in the ‘Staging slot’ with own database secondaries. Then on both sites, the VIP swap is done to have the temporary deployments in the production slot. The existing WATM policy redirects users to the temporary deployment, and the system is then setup in read-only mode. Now, an upgrade can be done to the staging slots of Site A and then Site B to the new application and database versions. After QA signoff, via VIP swap the latest versions can be pushed out to the production slots. Then the temporary deployments can be deleted from both sites. Site A continues to be the primary while Site B is the DR location.

This topology leverages the Active Geo Replication relationship and creates additional copies for protection and applies the upgrade process, with the primary motivation to minimize any disruption. This approach ensures that the overall solution has availability when upgrade is in transition and if it needs to be re-applied.

Figure 1

Upgrade Steps

  1. Using Active Geo Replication, create 2 additional replicas for the database, one in Site A and another one in Site B.

  2. The application can continue to be in “read write” mode with full functionality. Monitor the secondaries and see if they have been seeded and take note of the lag. Pick a suitable point to begin upgrade.

    Figure 2
  3. On both sites, create the temporary application deployment in the ‘Staging slot’. Point the staging point connection strings to the database secondaries created in step 2 above. Both Staging and Production slots have the “current” app version with its own database on both sites.

    Figure 3
  4. Note that until now, the application has been functioning without any disruptions. Setting up the secondaries was to create backup database copies without degrading application functionality.

  5. Now, make the application deployed in Site A is changed to “read-only” mode. Monitor your custom application telemetry to ensure the read-only access is effective. All the secondaries are being synchronized. Since no new writes are happening in the system, this should happen reasonably quickly.

  6. Initiate planned termination of the replication relationship to newly the newly established secondaries. NOTE: this is only for the newly established secondaries, the initial relationship between the primary and secondary must be maintained.

    Figure 4
  7. Do a VIP swap in both Site locations. The temporary deployment is now running in the production slot of both sites with its copy of the database. In case of DR event, users would be redirected to Site B cloud service which is accessing its own DB.

    Figure 5
    noteNote
    Across the two sites, the databases used by the production slots don’t need Active Geo-Replication between themselves since the system is in “read-only” mode.

  8. Note that the databases used by the application in the Staging slots have the existing Active Geo-Replication relationship. Maintain this Active Geo-Replication relationship and deploy the new app version in Site A and Site B.

    Figure 6
  9. Deploy the new app and DB version in Site A. All changes are in the Staging slot and the DB used by it. Due to the existing Active Geo-Replication relationship, the database in Site B is automatically upgraded

  10. Deploy the new app in Site B. All changes are in the Staging slot. Since the database has been automatically upgraded, there are no additional steps

    Figure 7
  11. Do a VIP swap to make the new version of the application in the “Production” slot on both Site A & Site B.

  12. Remove the “read-only” restriction on Site A

  13. Delete the artifacts from the staging slot (cloud services, DB copy)

    Figure 8

Summary

  • Active Geo Replication was used to create the temporary database on both Sites with full functionality.

  • The Upgrade RTO is defined by the time to ensure the system is in “read-only” mode, plus the time to initiate planned termination, plus the time it took for VIP swap on both sites, plus the time it takes to apply the upgrade script, plus the QA signoff on the new version, plus the time to do VIP swap again and then finally to make the site read-write.

    RTO = Time to turn on Read Only Mode + Time for planned termination + VIP swap duration + Upgrade Duration + QA SignOff + VIP swap duration + Time to turn on read-write

  • Since the Active Geo-Replication relationship is maintained, automatically database changes are applied on both sites.

  • Geo-redundancy is maintained for the system at all points.

  • ROLLBACK: Since the staging slots has its own copy of database, it is possible to easily rollback in case of any errors during the upgrade.

    • For some reason, if the upgrade failed on the primary or the primary database was left in an inoperable state, there are at least two other secondary databases which are a read-only copy. The primary can be recreated from them, turned to read-write mode and the upgrade begun

  • The incremental costs are the creation of the ‘staging’ deployment and creation of additional Active database secondaries on both Sites.

Upgrade of Site A and Site B in isolation, with no DR during the upgrade process

This topology has further trade-offs from the previous approach. In this case, it leverages the DR location for application and database upgrades, with no protection when this is done. The primary motivation is to lower COGs. This has a serious implication that the overall DR strategy is compromised, and needs to be incorporated as part of the SLA when applying this topology. In case the primary site goes down while the DR site is in upgrade transition, there will be down time.

In this topology, initially Site A serves as the primary while ‘Site B’ serves as the DR location. Azure Traffic Manager (WATM) is configured with Failover policy and includes the cloud services endpoint from both locations. The upgrade will be first applied to Site B (DR location). On successful completion of the upgrade process, the roles of the sites are reversed. Site B will become the primary serving the new application version while Site A becomes the DR site.

Figure 9

Upgrade Steps

  1. Ensure application deployed in Site A is changed to “read-only” mode. Monitor your custom application telemetry to ensure the read-only access is effective – any in-flight writes have completed and no new writes are happening.

  2. Remove ‘Site B’ cloud service from WATM policy

  3. Initiate planned termination of the Active Geo-Replication relationship. Planned termination will ensure both databases are identical after it is completed.

  4. With the WATM policy change and the Active Geo-Replication relationship removed, Site B is isolated completed. In Site B, now setup the new app and the DB version. Test the new application and new database versions.

    Figure 10
  5. Then, modify the WATM policy to add Site B and remove Site A cloud service. Wait until the DNS TTL setting has expired and all traffic flows to Site B. Users are now exposed to the new application and database versions. Monitor to ensure that all traffic flows to Site B.

  6. Upgrade Site A cloud service to the new application version. The old database version may be dropped after backing up for historical purposes as per your IT policy.

  7. Setup a new Active Geo-Replication relationship from Site B to Site A. In Site A, this new database secondary will be used by the new application version setup in Step 7. In Site A, monitor until the secondary has been seeded and synchronized with the primary database in Site B.

  8. Modify the WATM configuration (which has been setup in failover) to add Site A hosted service as the failover site

  9. Remove the “read-only” restriction on Site B

    Figure 11

Summary

  • The Upgrade RTO depends on the time to ensure the system is in “read-only” mode, time to modify the Azure Traffic Manger policy to remove Site B (since there will be no DR during upgrade), time to complete the planned termination, upgrade time for Site B, plus the QA signoff time of the new application & database version, time to adjust Azure Traffic Manager policy to remove Site A and add Site B (now that Site B is upgraded, it is time to prepare for Site A), Time to upgrade Site A, plus QA signoff, plus the time taken by Azure Traffic Manager to add back Site A and then finally to turn Site B to read-write.

    RTO = Time to turn on Read Only Mode + Time to adjust the Azure Traffic Manager policy to remove Site B + Time for planned termination + Time to upgrade Site B + QA SignOff + Time to adjust the Azure Traffic Manager policy to remove Site A and add Site B + Time to upgrade Site A + QA signoff + Time to setup a new Active Geo-Replication relationship + Time to adjust the Azure Traffic Manager policy to add Site A + Time to turn on read-write in Site B

  • As per the design choice to leverage the DR site for upgrade, there is no DR during the upgrade process. In case the primary site goes down while the DR site is in upgrade transition, there will be down time.

  • ROLLBACK: For any rollback operation, since the DR site is directly modified, the only way is to re-create the deployment artifacts in the DR site. If Site A goes down or is lost while Site B is being upgraded and the upgrade fails, then the only way to recover is to restore the database on Site B using point in time restore, which will cause downtime.

  • There are minimal to none incremental costs. This process leverages the existing deployments and the incremental costs for the upgrade are minimal (Eg: bandwidth)

Deployment Topology: Single Region

Upgrade using ‘Staging’ deployment slot of cloud service and Active Geo Replication

In this topology, the application has a cloud service deployed in the ‘Production slot’ using an Azure SQL DB. Before doing an application upgrade, leveraging Active Geo Replication, a secondary will be created and used by the “Staging slot” deployment.

noteNote
This implies that the Azure SQL Database is deployed in Premium SKU.

Once the database secondary has been seeded and caught up, the application will be made “read-only” and the ‘Staging’ deployment will be created with its own copy of the database. The Staging deployment will be upgraded to the new application version. After QA signoff of the new version, this will be made the production slot via VIP swap. All artifacts with the previous version will be deleted (Staging deployment of cloud service, old Azure SQL DB, old Azure Storage tables, blobs).

Figure 12

Upgrade Steps

  1. Leveraging Active Geo Replication, create an additional replica for the database in Site A.

  2. The application can continue to be in “read write” mode with full functionality. Monitor the secondary and see if they have been seeded and have caught up.

  3. Deploy the application to the ‘Staging Slot’ with the service configuration settings pointing to copy of Azure SQL Database created above.

  4. Now, prepare the application to be in “degraded” read-only functionality. Ensure application deployed is changed to “read-only” mode. Monitor your custom application telemetry to ensure the read-only access is effective.

  5. Initiate planned termination of the Active Geo-Replication relationship. Planned termination will ensure that both databases are identical after it is completed.

  6. Perform application upgrades in the Staging slot and Azure SQL Database upgrades used by the application in the ‘Staging slot’.

  7. On successful QA sign off, change the service configuration to remove “read-only” access. Validate that and then do a VIP swap to promote the staging instance (which has new application and database changes) to production.

Summary

  • Active Geo Replication was used to create the temporary database with full functionality.

  • The Upgrade RTO depends on the time to ensure the system is in “read-only” mode, plus the time taken for planned termination of Active Geo Replication, upgrade time plus QA signoff time, turning on read-write and finally to do a VIP swap to make the new changes visible.

    RTO = Time to turn on Read Only Mode + Time for planned termination + Time to upgrade + QA SignOff + Time to turn on read-write + Time to do a VIP swap

  • Availability within the same region is maintained during upgrade by maintaining current application version in the production slot with its own SQL Azure DB copy. Application updates are applied to the ‘Staging’ slot and the database updates are applied to the database copy

  • ROLLBACK: In case of any deployment or user errors, the ‘Staging’ artifacts may be deleted and re-created. Since this is not exposed to the end users, this can be repeated which provides the ability to rollback.

  • The incremental costs are the creation of the ‘staging’ deployment and data component copies.

Upgrade using ‘Staging’ deployment slot of cloud service and a copy of the database

This is a topology that has more trade-off when compared to the previous. This is a valid option for the non-premium SKU databses. The process does not leverage Active Geo-Replication and instead uses DBCOPY.

Figure 13

Upgrade Steps

  1. Deploy the new application to the ‘Staging Slot’

  2. Ensure application deployed is changed to “read-only” mode. Monitor your custom application telemetry to ensure the read-only access is effective.

  3. Create a copy of the Azure SQL Database. (using database copy command)

  4. Modify the service configuration settings of the application in “Staging” slot, and point to copy of Azure SQL Database created above.

  5. Perform application upgrades in the Staging slot and Azure SQL Database upgrades used by the application in the ‘Staging slot’.

  6. On successful QA sign off, change the service configuration to remove “read-only” access. Validate that and then do a VIP swap to promote the staging instance (which has new application and database changes) to production.

Summary

  • DBCOPY is used to create the database copy and this approach is a valid option for non premium SKU databases.

  • The Upgrade RTO depends on the duration to turn the system to “read-only” mode, DBCOPY duration which depends on the amount of data in Azure SQL DB, service configuration change, plus the upgrade time, plus the QA signoff time, time to turn the application to read-write and finally to complete the VIP swap.

    RTO = Time to turn on Read Only Mode + Time for DBCOPY to complete + Time for service configuration change, Time to upgrade + QA SignOff + Time to turn on read-write + Time to do a VIP swap.

  • Availability within the same region is maintained during upgrade by maintaining current application version in the production slot with its own SQL Azure DB copy. Application updates are applied to the ‘Staging’ slot and the database updates are applied to the database copy

  • ROLLBACK: In case of any deployment or user errors, the ‘Staging’ artifacts may be deleted and re-created. Since this is not exposed to the end users, this can be repeated which provides the ability to rollback.

  • The incremental costs are the creation of the ‘staging’ deployment and data component copies

See Also

Was this page helpful?
(1500 characters remaining)
Thank you for your feedback

Community Additions

ADD
Show:
© 2014 Microsoft