Possible Failures During Database Mirroring
Physical, operating system, or SQL Server problems can cause a failure in a database mirroring session. Database mirroring does not regularly check the components on which Sqlservr.exe relies to verify whether they are functioning correctly or have failed. However, for some types of failures, the affected component reports an error to Sqlservr.exe. An error reported by another component is called a hard error. To detect other failures that would otherwise go unnoticed, database mirroring implements its own time-out mechanism. When a mirroring time-out occurs, database mirroring assumes that a failure has occurred and declares a soft error. However, some failures that happen at the SQL Server instance level do not cause mirroring to time-out and can go undetected.
The speed of error detection and, therefore, the reaction time of the mirroring session to a failure, depends on whether the error is hard or soft. Some hard errors, such as network failures are reported immediately. However, in some cases, component-specific time-out periods can delay the reporting of some hard errors. For soft errors, the length of the mirroring time-out period determines the speed of error detection. By default, this period is 10 seconds. This is the minimum recommended value.
Possible causes of hard errors include (but are not limited to) the following conditions:
A broken connection or wire
A bad network card
A router change
Changes in the firewall
Loss of the drive where the transaction log resides
Operating system or process failure
For example, when the log drive on the principal database becomes unresponsive and fails, the operating system informs Sqlservr.exe that a serious error has occurred.
Some components, such as network components and some IO subsystems, have their own time-outs to determine failures. Such time-outs are independent of database mirroring, which has no knowledge of them and is completely unaware of their behavior. In these cases, the time-out delay increases the time between a failure and when database mirroring receive the resulting hard error.
To help you interpret the error conditions that occur on the network, ask a network engineer what error messages are sent to a port when the following events occur on a TCP connection:
DNS is not working.
Cables are unplugged.
Microsoft Windows has a firewall that blocks a specific port.
The application that is monitoring a port fails.
A Windows-based server is renamed.
A Windows-based server is rebooted.
Conditions that might cause mirroring time-outs include (but are not limited to) the following:
Network errors such as TCP link time-outs, dropped or corrupted packets, or packets that are in an incorrect order.
A hanging operating system, server, or database state.
A Windows server timing out.
Insufficient computing resources, such as a CPU or disk overload, the transaction log filling up, or the system is running out of memory or threads. In these cases, you must increase the time-out period, reduce the workload, or change the hardware to handle the workload.
Because soft errors are not detectable directly by a server instance, a soft error could potentially cause a server instance to wait indefinitely. To prevent this, database mirroring implements its own time-out mechanism, based on each server instance in a mirroring session sending out a ping on each open connection at a fixed interval.
To keep a connection open, a server instance must receive a ping on that connection in the time-out period defined, plus the time that is required to send one more ping. Receiving a ping during the time-out period indicates that the connection is still open and that the server instances are communicating over it. On receiving a ping, a server instance resets its time-out counter on that connection.
If no ping is received on a connection during the time-out period, a server instance considers the connection to have timed out. The server instance closes the timed-out connection and handles the time-out event according to the state and operating mode of the session.
Even if the other server is actually proceeding correctly, a time-out is considered a failure. If the time-out value for a session is too short for the regular responsiveness of either partner, false failures can occur. A false failure occurs when one server instance successfully contacts another whose response time is so slow that its pings are not received before the time-out period expires.
In high-performance mode sessions, the time-out period is always 10 seconds. This is generally enough to avoid false failures. In high-safety mode sessions, the default time-out period is 10 seconds, but you can change the duration. To avoid false failures, we recommend that the mirroring time-out period always be 10 seconds or more.
To change the time-out value (high-safety mode only)
- Use the ALTER DATABASE <database> SET PARTNER TIMEOUT <integer> statement.
To view the current time-out value
- Query mirroring_connection_timeout in sys.database_mirroring.
Regardless of the type of error, a server instance that detects an error responds appropriately based on the role of the instance, the operating mode of the session, and the state of any other connection in the session. For information about what occurs on the loss of a partner, see Database Mirroring Operating Modes.
Estimate the Interruption of Service During Role Switching (Database Mirroring)
Database Mirroring Operating Modes
Role Switching During a Database Mirroring Session (SQL Server)
Database Mirroring (SQL Server)