WSFC Disaster Recovery through Forced Quorum (SQL Server)
Topic Status: Some information in this topic is preview and subject to change in future releases. Preview information describes new features or changes to existing features in Microsoft SQL Server 2016 Community Technology Preview 2 (CTP2).
Quorum failure is usually caused by a systemic disaster, or a persistent communications failure, or a misconfiguration involving several nodes in the WSFC cluster. Manual intervention is required to recovery from a quorum failure.
The Forced Quorum Procedure assumes that a healthy quorum existed before the quorum failure.
The user should be well-informed on the concepts and interactions of Windows Server Failover Clustering, WSFC Quorum Models, SQL Server, and the environment's specific deployment configuration.
For more information, see: Windows Server Failover Clustering (WSFC) with SQL Server, WSFC Quorum Modes and Voting Configuration (SQL Server)
The user must be a domain account that is member of the local Administrators group on each node of the WSFC cluster.
Remember that quorum failure will cause all clustered services, SQL Server instances, and AlwaysOn Availability Groups, in the WSFC cluster to be set offline, because the cluster, as configured, cannot ensure node-level fault tolerance. A quorum failure means that healthy voting nodes in the WSFC cluster no longer satisfy the quorum model. Some nodes may have failed completely, and some may have just shut down the WSFC service and are otherwise healthy, except for the loss of the ability to communicate with a quorum.
To bring the WSFC cluster back online, you must correct the root cause of the quorum failure under the existing configuration, recover the affected databases as needed, and you may want to reconfigure the remaining nodes in the WSFC cluster to reflect the surviving cluster topology.
You can use the forced quorum procedure on a WSFC cluster node to override the safety controls that took the cluster offline. This effectively tells the cluster to suspend the quorum voting checks, and lets you bring the WSFC cluster resources and SQL Server back online on any of the nodes in the cluster.
This type of disaster recovery process should include the following steps:
To Recover from Quorum Failure:
Determine the scope of the failure. Identify which availability groups or SQL Server instances are non-responsive, which cluster nodes are online and available for post-disaster use, and examine the Windows event logs and the SQL Server system logs. Where practical, you should preserve forensic data and system logs for later analysis.
On a responsive instance of SQL Server 2016, you can obtain information about the health of availability groups that possess an availability replica on the local server instance by querying the sys.dm_hadr_availability_group_states dynamic management view (DMV).
Start the WSFC cluster by using forced quorum on a single node. Identify a node with a minimal number of component failures, other than that the WSFC cluster service was shut down. Verify that this node can communicate with a majority of the other nodes.
On this node, manually force the cluster to come online using the forced quorum procedure. To minimize potential data loss, select a node that was last hosting an availability group primary replica.
For more information, see: Force a WSFC Cluster to Start Without a Quorum
The forced quorum setting has a cluster-wide affect to block quorum checks until the logical WSFC cluster achieves a majority of votes and automatically transitions to a regular quorum mode of operation.
Start the WSFC service normally on each otherwise healthy node, one at a time. You do not have to specify the forced quorum option when you start the cluster service on the other nodes.
As the WSFC service on each node comes back online, it negotiates with the other healthy nodes to synchronize the new cluster configuration state. Remember to do this one node at a time to prevent potential race conditions in resolving the last known state of the cluster.
Ensure that each node that you start can communicate with the other newly online nodes. Consider disabling the WSFC service on the other nodes. Otherwise, you run the risk of creating more than one quorum node set; that is a split-brain scenario. If your findings in step 1 were accurate, this should not occur.
Apply new quorum mode and node vote configuration. If forcing quorum successfully restarted all the nodes in the cluster and the root cause of the quorum failure has been corrected, changes to the original quorum mode and node vote configuration are unnecessary.
Otherwise, you should evaluate the newly recovered cluster node and availability replica topology, and change the quorum mode and vote assignments for each node as appropriate. Un-recovered nodes should be set offline or have their node votes set to zero.
At this point, the nodes and SQL Server instances in the cluster may appear to be restored back to regular operation. However, a healthy quorum may still not exist. Using the Failover Cluster Manager, or the AlwaysOn Dashboard within SQL Server Management Studio, or the appropriate DMVs, verify that a quorum has been restored.
Recover availability group database replicas as needed. Non-availability group databases should recover and come back online on their own as part of the regular SQL Server startup process.
You can minimize potential data loss and recovery time for the availability group replicas by bringing them back online in this sequence: primary replica, synchronous secondary replicas, asynchronous secondary replicas.
Repair or replace failed components and re-validate cluster. Now that you have recovered from the initial disaster and quorum failure, you should repair or replace the failed nodes and adjust related WSFC and AlwaysOn configurations accordingly. This can include dropping availability group replicas, evicting nodes from the cluster, or flattening and re-installing software on a node.
You must repair or remove all failed availability replicas. SQL Server will not truncate the transaction log past the last known point of the farthest behind availability replica. If a failed replica is not repaired or removed from the availability group, the transaction logs will grow and you will run the risk of running out of transaction log space on the other replicas.
If you run the WSFC Validate a Configuration Wizard when an availability group listener exists on the WSFC cluster, the wizard generates the following incorrect warning message:
"The RegisterAllProviderIP property for network name 'Name:<network_name>' is set to 1 For the current cluster configuration this value should be set to 0."
Please ignore this message.
Repeat step 4 as needed. The goal is to re-establish the appropriate level of fault tolerance and high availability for healthy operations.
Conduct RPO/RTO analysis. You should analyze SQL Server system logs, database timestamps, and Windows event logs to determine root cause of the failure, and to document actual recovery point and recovery time experiences.