Troubleshooting Applications

Article
09/12/2012

AppFabric provides capabilities for managing IIS-hosted .NET Framework version 4 WCF and WF services. These capabilities include reliability mechanisms for hosting durable workflow services, simplified application configuration, and tooling to monitor the health of your .NET Framework 4 WCF and WF services. This section describes how to use the monitoring tooling to troubleshoot services.

Use the Dashboard to begin troubleshooting WCF and WF services. Click the Dashboard icon in the AppFabric group within IIS Manager to display the Dashboard page. This composite page offers a summary view of the health of the applications deployed at a particular scope in IIS and provides links to drill down for more specific troubleshooting information. If the metrics displayed by the Dashboard and its related pages do not provide the depth of information you need to resolve your problem, you can use AppFabric tooling to enable System.Diagnostics tracing to troubleshoot a WCF or WF application.

Troubleshooting by Using the Dashboard

For distributed applications, isolating a problem is not as simple as looking at one counter or using only one tool in isolation. Rather, solving problems in a distributed service environment often involves correlating data from different tools or counters to create a realistic picture of the true cause of a problem. The Dashboard contains three widgets, Persisted WF Instances, WCF Call History, and WF Instance History, that you can use to correlate information for troubleshooting your applications. The latter two of these are historical metrics, which means that what is displayed is affected by the Time Period time selector at the top of the Dashboard (Last 1 minute, Last 24 hours, and so on).

When you first open or view one of the three widgets, you see a high-level summary view of the status of its metrics. You can quickly see if there is a problem at that level. For example, the Persisted WF Instances widget initially displays the status of live durable workflow instances that have saved their state in the persistence store. It gives a summary status of the number of persisted instances that are in the Active, Idle, and Suspended states. This can quickly inform you if there is a problem at the persisted workflow level. You can click links on the summary page to get additional information about that specific summary counter.

Using Persisted WF Instances Metrics

The Persisted WF Instances widget shows the status of live workflow instances that have saved their state in the persistence store. It shows the number of persisted instances that are in the Active, Idle, and Suspended states. Each heading, and the summary count itself, is a link to the Persisted WF Instances page that contains the raw data.

To better understand the problem when troubleshooting suspended instances, click the Suspended Instances link on the Dashboard. The Persisted WF Instances page shows you all the suspended instances. Selecting one of them populates the Details pane at the bottom of the screen with summary information about the suspended instance. The Errors tab shows the reason why the instance was suspended. If you need additional information you can right-click an instance and select the View Tracked Events option. This action takes you to the Tracked Events page, showing all the tracked events for the suspended workflow instance. The events are sorted by default with the newest ones on top.

Using WCF Call History Metrics

The WCF Call History widget displays the number of WCF calls that have been received and recorded in the monitoring store. The heading displays a summary count of the calls that completed, encountered exceptions, or the number of times a throttle was hit. The first column shows the top five services with the most completed calls. The second column shows a breakdown of the WCF errors grouped by error type. The third column shows the top five services with the most service exceptions. Each metric is linked to the Tracked Events page, where you can see the raw data that is summarized on the Dashboard.

For example, to get more information about a failed call, click the Calls Failed summary counter and go to the Tracked Events page. This page is populated with the latest failed WCF calls for the selected scope in the IIS hierarchy. Selecting one of these events populates the Details pane at the bottom of the screen. The Errors tab contains exception information about the failure. If you need more context about the failed call you can right-click the event and select the View All Related Events option. This refreshes the Tracked Events page and populates it with all the events related to the first one.

Note

To use the View All Related Events option, the Monitoring level for the application must be set to End-to-End Monitoring or higher. This level tells the Monitoring infrastructure to collect transfer events that associate one end-to-end activity ID (E2EActivityId) to another.

Using WF Instance History Metrics

The WF Instance History widget shows the number of workflow instances that have been activated, failed, and completed. The first column shows the top five services with the most activations. The second column shows the top five services with the most failed instances. The third column shows how many of the failed instances were recovered. For example, if a workflow instance encountered an error that caused it to be suspended and was later resumed and completed successfully, you would see a count of one in both the Failed column and the Recovered column.

Troubleshooting by using information from the WF Instance History widget is similar to using the Persisted WF Instances widget. Click one of the headings to go to the Tracked WF Instances page where you can view summary information about the workflow instance. However, because the WF Instance History widget displays workflow instances that may have already completed, you will not see the instance control actions that you see on the Persisted WF Instances page. On the Tracked WF Instances page you can right-click an instance and view its tracked events to see low-level tracking data for the instance.

Troubleshooting by Using System.Diagnostics Tracing

AppFabric leverages the enhancements to WCF tracing and WF tracking in the .NET Framework 4 that emit events to the Event Tracing for Windows (ETW) subsystem. ETW provides a fast event-tracing infrastructure. In some scenarios, however, you will need to see all the diagnostic information available. In AppFabric you can enable System.Diagnostics tracing at the application or site level. This tracing information is written to files on the disk and can be viewed with the Service Trace Viewer tool.

Warning

System.Diagnostics tracing reduces the performance of your applications and generates large trace files. You should enable System.Diagnostics tracing only when you are troubleshooting your application.

Troubleshooting a Distributed ASP.NET Application

You can use the activity ID (as mentioned in the previous Using WCF Call History Metrics section) to lookup a durable instance ID and help troubleshoot an ASP.NET application that communicates across multiple AppFabric server machines to WCF services. To accomplish this, the monitoring level must be set to at least End-To-End Monitoring to configure activity tracing. Let’s take one example of how this may occur.

While monitoring the Web application using ASP and IIS monitoring tools, the administrator notices an increase in timeout errors for the ASP.NET application while communicating with a service. After checking the errors, the administrator obtains the activity ID that is active at the time the error occurred. Using the AppFabric Dashboard, the administrator creates and executes a query for that same activity ID. One event is returned – a receive event on instance Y of service X. He views the message flow associated with that event, and notices a “Throttling Maximum Concurrent Calls exceeded” event. Looking at the service summary pages, the administrator notices that the “WCF Throttle Hits” is set to 20. He also sees views the calls over a period of the last 24 hours, has spiked over the last 30 minutes. He views other days, and notices the same thing at the same time on each day. He concludes that the load increases at those times each day, and the max concurrent call throttling setting should be increased to 25. Watching it carefully, the administrator notices the incidence of throttling events decreases dramatically as do the timeout errors for the ASP.NET application when calling the WCF services.

SQL Server Agent Windows Service Jobs

The SQL Server Agent Windows service monitors SQL Server operations, automates certain administrative duties, generates alerts as needed, and schedules and executes jobs. A SQL Server job is a specified series of operations performed sequentially by the SQL Server Agent covering a variety of database-related tasks. When AppFabric is configured to use SQL Server, it uses the SQL Server Agent to import events into the Monitoring data store and to regularly purge old data from the store.

If the SQL Server Agent is not running, then the scheduled AppFabricjobs cannot execute within your SQL Server installation. Here are some steps to ensure that these jobs execute as scheduled using the SQL Server Agent Windows service:

Ensure that the SQL Server Agent Windows service is running by checking its status. Click Administrative Tools, select Services, select SQL Server Agent (MSSQLSVR), and ensure that its Status currently shows Started. If not, start the service.
Four SQL Server jobs are created when the store is initialized:
- Microsoft_ApplicationServer_Monitoring_AutoPurge_<monitoring DB name>
- Microsoft_ApplicationServer_Monitoring_ImportWfEvents_<monitoring DB name>
- Microsoft_ApplicationServer_Monitoring_ImportWcfEvents_<monitoring DB name>
- Microsoft_ApplicationServer_Monitoring_ImportTransferEvents_<monitoring DB name>
If the SQL Server Agent Windows service is started, and the AppFabricjobs are not running, check the errors encountered when the job was last run.
If the job instances finished successfully but there are no events in the event tables, look at the ASFailedStagingTable table under the Monitoring data store. This table contains certain columns, such as ErrorNumber and ErrorMessage, that can help you figure out the reason for failure. If an error did not occur then this table will be empty.

The Windows identity (owner) under which an AppFabric SQL Server Agent job runs should not be that of the logged-in user. Rather, it should run under the identity of the preconfigured AS_MonitoringDbJobsAdmin Windows security account. This account should ideally be made a domain account. This account is given the appropriate permissions in the monitoring store when the Initialize-ASMonitoringDatabase cmdlet is run following installation.

Here is how the SQL Server Agent processes the different owners of a submitted AppFabric Server job:

If the owner of a SQL Server Agent job is a domain account, SQL Server contacts the domain controller to check if the account is valid. If so, it runs under the domain account. If not, it tries to run under a local account.
If the owner of a SQL Server Agent job is a local sysadmin account, the SQL Server Agent will correctly honor the “RunAs” identity on the job step and issue the "Execute user as AS_MonitoringDbJobsAdmin" command. This means the job runs under the identity of the AS_MonitoringDbJobsAdmin account.

Note

To view the “RunAs” identity for a job within SQL Server Management Studio, right-click the job, click Properties, and then click the Advanced tab. This value should be set to the AS_MonitoringDbJobsAdmin account.
If the owner of a SQL Server Agent job is a local account but is not sysadmin, the ”RunAs” identity provided on the job step is ignored. In this case the SQL Server Agent service runs the job as the local owner.
In the disconnected domain scenario (a laptop, for example), if the identity of a SQL Server monitoring job is a domain user, the SQL Server Agent tries to validate the account by contacting the domain controller. If the computer running the SQL Server Agent job is disconnected from the domain, this will fail. To mitigate this failure, there are two options:
1. The first, and simplest, of the two options is to configure the job (that is, run the Initialize-ASMonitoringDatabase cmdlet) using a local user account.
2. The second choice is that if a job is configured by an owner using a domain account, the user can update the job owner afterwards to a local user account by using the SQL Server sp_update_job stored procedure.

A best practice is to configure the SQL Server Agent Windows service to run under a local logon account. This allows the service to run on an AppFabric computer even if its network connection has been terminated. If this service is running under a domain account and the domain cannot be accessed, the SQL Server Agent Windows service cannot obtain credentials. This means that the Monitoring events cannot be moved correctly into their final destination tables. The result is that no new data will be displayed in the Dashboard.

SQL Server Express Jobs

Although the steps in diagnosing why a job failed are very similar to SQL Server, you will need to look at the ASJobsTable table for SQL Server Express. This table is specific to a SQL Server Express installation and is not present in a SQL Server installation. Within this table you can view the values of the LastRunOn and LastRunSuccess columns in a specific job row to determine whether a job has run successfully or failed.

SQL Server Express does not use the SQL Server Agent Windows service. Rather, it leverages the SQL Service Broker. There is a time capability within the Service Broker feature that has a time-out value set to the interval at which the Microsoft_ApplicationServer_Monitoring_AutoPurge_<monitoring DB name> job is configured to run. When this time-out interval occurs, a message is sent to the SQL Server job queue. This message activates the stored procedure that is run as part of the Microsoft_ApplicationServer_Monitoring_AutoPurge_<monitoring DB name> job. In turn this runs the automatic purge functionality against a SQL Server Express store.

Here are some T-SQL queries you can run to help monitor the progress of the automatic store purge functionality:

Shows the currently scheduled jobs: SELECT * FROM ASJobsTable
Looks at the dialog_timer column (in UTC time) to see when the job is scheduled to run next: SELECT * FROM sys.conversation_endpoints
Shows the number of currently executing activation procedures: SELECT * FROM sys.dm_broker_activated_tasks
Discovers how many messages are still in the queue. When no jobs are running this will return 0: SELECT * FROM ASScheduledJobQueue