AlwaysOn Availability Groups Troubleshooting and Monitoring Guide
This guide helps you get started on troubleshooting some of the common issues in AlwaysOn Availability Groups and monitoring AlwaysOn Availability Groups. It is intended to provide original content as well as a landing page of useful information that is already published elsewhere.
While this guide cannot fully discuss all the issues that can occur on the large surface area covered by AlwaysOn Availability Groups, it can point you in the right direction in your root-cause analysis and resolution of the issues. As AlwaysOn Availability Groups is an integrated technology, many of the problems you encounter are only symptoms of other issues in your database system. Some issues are caused by settings within an availability group, such as an availability database being suspended. Other issues can include problems you can isolate to other aspects of SQL Server, such as SQL Server settings, database file deployments, and systemic performance issues unrelated to the availability group, replica, or database. Still other problems and exist outside of SQL Server, such as network I/O, TCP/IP, Active Directory, and Windows Server Failover Clustering (WSFC). Often, problems that surface in an availability group, replica, or database require you to troubleshoot multiple technologies before you can identify the root cause.
The table below contains links to the common troubleshooting scenarios for AlwaysOn Availability Groups. They are categorized by their scenario types, such as configuration, client connectivity, failover, and performance.
Provides information to help you troubleshoot typical problems with configuring server instances for AlwaysOn Availability Groups. Typical configuration problems include AlwaysOn Availability Groups is disabled, accounts are incorrectly configured, the database mirroring endpoint does not exist, the endpoint is inaccessible (SQL Server Error 1418), network access does not exist, and a join database command fails (SQL Server Error 35250).
When you create an AlwaysOn availability group by using the New Availability Group Wizard in Microsoft SQL Server 2012, you receive a warning message that resembles the following: “The current WSFC cluster quorum vote configuration is not recommended for this availability group.”
You encounter errors when trying to create an availability group listener.
An add-file operation caused the secondary database to be suspended and be in the NOT SYNCHRONIZING state.
You encounter error 41009 when trying to create multiple availability groups.
After you configure the availability group listener, you are unable to ping the listener or connect to it from an application.
An automatic failover did not complete successfully.
After an automatic failover or a planned manual failover without data loss, the failover time exceeds your RTO. Or, when you estimate the failover time of a synchronous-commit secondary replica (such as an automatic failover partner), you find that it exceeds your RTO.
After you perform a forced manual failover, your data loss is more than your RPO. Or, when you calculate the potential data loss of an asynchronous-commit secondary replica, you find that it exceeds your RPO.
The client application completes an update on the primary replica successfully, but querying the secondary replica shows that the change is not reflected.
When configuring or running AlwaysOn Availability Groups, the different tools can help you diagnose different types of issues. The table below provides links to useful information on the tools.
Reports an at-a-glance view of the health of your availability group in a user-friendly interface.
Used by the AlwaysOn Dashboard.
Logs state transition events for availability groups, replicas, and databases, statuses of other AlwaysOn components, and AlwaysOn errors.
Logs cluster events, including state transitions of the availability group resource, as well as events and errors from SQL Server resource DLL.
Logs SQL Server health diagnostics as reported to the WSFC cluster (SQL Server resource DLL) by sp_server_diagnostics (Transact-SQL).
Reports information on the availability groups such as configuration, health status, and performance metrics.
Provides detailed diagnotics of the availability groups and useful for root-cause analysis.
Provides wait statistics specific to availability groups and useful for performance tuning.
AlwaysOn Performance Counters
Monitor AlwaysOn Availability Groups activity and are reflected in System Monitor, and is useful for performance tuning. For more information, see SQL Server, Availability Replica and SQL Server, Database Replica.
Record alerts within the SQL Server system for internal diagnostics, and can be used to debug issues related to the availability groups.
The ideal time to troubleshoot an availability group is before a problem necessitates a failover, whether automatic or manual. This can be done by monitoring the availability group’s performance metrics and sending alerts when the availability replicas are performing outside the bounds of your service-level agreement (SLA). For example, if a synchronous secondary replica has performance issues that cause the estimated failover time to increase, you do not want to wait until an automatic failover occurs and you find out that the failover time exceeds your recovery time objective.
As AlwaysOn Availability Groups is a high availability and disaster recovery solution, the most important performance metrics to monitor are the estimated failover time, which affects your recovery time objective (RTO), and the potential data loss in a disaster, which affects your recovery point objective (RPO). You can gather these metrics from the data that SQL Server exposes at any given time, so you can be alerted of a problem in the HADR capabilities of your system before the actual failure events occur. Therefore, it is important to familiarize yourself with the data synchronization process of AlwaysOn Availability Groups and gather the metrics accordingly.
This table below points you to topics that can help you monitor the health of your AlwaysOn Availability Groups solution.
Describes the data synchronization process for AlwaysOn Availability Groups, the flow control gates, and useful metrics when monitoring an availability group; and also shows how to gather RTO and RPO metrics.
Provides information on tools for monitoring an availability group.
Provides an overview of the AlwaysOn health model.
Shows how to customize the AlwaysOn health model and customize the AlwaysOn Dashboard to show extra information.
Provides a basic overview of the AlwaysOn PowerShell cmdlets that can be used to monitor the health of an availability group.
Provides information on advanced usage of the AlwaysOn PowerShell cmdlets to monitor the health of an availability group.
Shows how to automatically monitor an availability group with an application.
Provides information on how to integrate availability group monitoring with SQL Server Agent and configure notification to the appropriate parties when problems arise.