Debugging Production Problems in Your XML Web Service
December 19, 2001
Debugging can be a particularly challenging part of any development work, even when you are sitting at your desk with all your normal development tools and resources. It can become a lot more challenging if you have to move your debugging environment to a cold and noisy computer lab with none of your normal tools, where you work under strict constraints on how much downtime you are allowed and with the nerve-wracking knowledge that the problems can often cost your company sums of money that make your annual salary seem about as significant as the price of two Chiclets®.
What Is It About Production Problems?
There are a number of exceptionally challenging issues with regard to fixing production problems. First and foremost is that by definition, you are talking about a problem that affects the ability of your system to do what it was designed to do. Simply changing a line of code, recompiling, and trying again is not an acceptable means of doing development work in this environment. The system has to stay up, and downtime must be minimized. Further, a problem once contained to a single machine in your main development environment is now significantly more complex; you may not know on what machine in the Web farm the problem resides. To top it all off, the machine you are working on may be exposed to the Internet, so common debugging techniques might not be able to be used for security reasons.
So what can you do?
I would be remiss if I didn't at least mention that debugging production problems would be a non-issue if you did not have production problems. My mother always told me that an ounce of prevention is worth a pound of cure, and good coding practices, proper testing, and careful deployment should be able to result in production software that does not fail. (Actually, my mother only told me about the prevention and cure part—not about the good practices part.)
Now realistically, we have all had problems in production, including those of us here at Microsoft®. However, it is not impossible to ship problem-free production software. In fact, it is getting easier and easier to write bug-free code. The number of lines of code required to write a functional transactional component in Microsoft® Visual Studio® .NET is minimal compared to what it would take to do the same thing 5 years ago. There really are a lot fewer places where a developer can make an honest mistake that could cause things to go astray.
Nevertheless, problems can still happen, and it is extremely important that you plan for problems in your production environment, and prepare yourself with the tools required, so that you are ready for problems if they happen. If you are fortunate enough to roll out a problem-free application, you may take comfort in knowing that the extra time spent preparing your systems and yourself was well spent polishing those preparation skills for the next rollout that may have problems.
Tools of the Trade
On your standard development machine, you probably use a debugger like Visual Studio .NET to fix most of the problems you run into. It is quite convenient to set break points and step into code, view the values of variables, and generally see where logic may be going astray. For more complex problems, you may need to also employ profiling tools, or tools used for stress testing your environment. When working with your XML Web Service, you might utilize one of the SOAP-tracing utilities provided by various toolkits. Yet for production problems, many of these tools will not be viable options. You may need different tools altogether, or you may need to use existing tools in different ways. Let's take a look at some of the tools that could prove useful in a production environment.
A legitimate production environment for your XML Web Service will include monitoring tools that will let you know if your XML Web Service is down on any of the machines in your Web farm. But monitoring also allows you to go beyond the Boolean question of whether your service is up or not. You can use monitoring to determine if any of your n-tier components are experiencing difficulties, or if their performance is not up to par. Let's take a look at how you could do this.
Custom Monitoring Tools
I'm going to start off my monitoring discussion by looking at the hardest to implement but probably the most important monitoring tool available, a custom monitor. I'm talking about writing a program that will use your XML Web Service just as your normal clients will use your service. This is not a stress-testing tool, nor even a normal client load-testing tool. This should be a tool that sends requests periodically to your server in something as close to a normal user scenario as you can implement.
Be aware that success and failure will not be the clear, binary result you might initially wish for. You will also need to carefully monitor the time intervals for your requests to complete. Set an acceptable delay threshold with your custom monitoring tool, but also log request/response intervals, so that historical data is available. For instance, you may see average response times slowing over the course of a fairly long period of time, due to higher usage. If you simply wait for results to cross your threshold for failure, you will find yourself in an emergency situation, trying to resolve the problem. However, if you are monitoring your site regularly and logging results, you can analyze the data and determine trends.
In the case of increased usage, you could then see the trend and address the problem by adding another machine to your Web farm, or by making improvements to the efficiency of your XML Web Service code. Other sorts of monitoring can give you more details on where problems are occurring, but there is not a better mechanism for determining the overall health of your service than by sending requests just like your clients are sending.
The tried-and-true monitoring tool on Microsoft® Windows® systems has always been Performance Monitor. The beauty of Performance Monitor is that it not only gives you the standard system counters for monitoring the general health of your systems, applications, and processes, but you can also create your own custom performance counters for your application, and monitor them through Performance Monitor's impressive infrastructure. Add to that its capability to log counter data, send alerts of failures through various means, and monitor multiple machines simultaneously, and you can begin to appreciate its power as a tool for unobtrusively finding problems in your production environment.
If you want to know how many requests your XML Web Service is currently handling, how long it is taking for requests to complete, or how internal resources are being used, then you can create your own performance counters to be monitored by Performance Monitor. The System.Diagnostics.PerformanceCounter class in the .NET Framework makes it easier than ever to implement your own performance counters. These can prove exceptionally useful when trying to track down a specific problem or bottleneck.
But implementing your own counters is hardly required for Performance Monitor to be considered a useful tool in tracking down problems. The system counters provided to determine CPU utilization and monitor your process's memory footprint are in themselves quite useful to see if your XML Web Service is using all of your CPU resources, or if it is leaking memory.
There are ASP.NET counters that you can use as well to simply determine how many requests are processing and how long they are taking. This can often address the needs for counters specific to your own application and thus prevent you from needing to write code that handles your own performance counters. In the case of XML Web Services, you can monitor the Microsoft® Internet Information Server "Post Requests/sec" counter to track requests to your virtual server. Depending on what other applications you might be running on your server, this could be a good way to figure out how many SOAP requests your XML Web Service is receiving.
XML Web Service Auditing
You should be auditing the events that occur with your XML Web Service, just like you do for any other production application. Information on completion codes, time elapsed while servicing the request, and critical parameter values can make your auditing records a key piece of information that is useful when debugging production problems. Statistical analysis on your audit logs should be considered for identifying trends and validating conclusions that may be found through other means. Audit logs can also prove useful if you need to determine the historical state of your data when analyzing a problem through other means.
Internet Information Server Logging
HTTP requests received by your machine are still funneled through Internet Information Server before being handled by your particular ASP.NET code. You can therefore take advantage of Internet Information Server's logging capabilities to monitor pieces of information about the requests being received. This will allow you to track how many requests you are handling, as well as specific error codes and request completion times. As opposed to performance monitor, the log allows you to see details about specific requests, instead of statistical analysis of the larger data space. To get the most information, you should use the "Microsoft IIS Log File Format" option when you configure your virtual server for logging requests.
Event viewer provides a means for reading system events, but like Performance Monitor, you can customize the data in Event Viewer for your own application. In the Favorites Service, our sample Web service, we created events whenever we experienced any server errors. These were events specific to the Favorites Service and had information that we could use to figure out what sorts of errors we were experiencing. Use the System.Diagnostics.EventLog class to log events using the .NET Framework.
Other Monitoring Tools
There are numerous other monitoring tools available, including tools from Microsoft, such as Microsoft® Application Center, and tools from third parties, such as NetIQ's WebTrends. Some simply take advantage of the information in the IIS logs, or from performance monitor counters, and make them easier to track and get alerts from in Web farm scenarios. Others plug into the underlying infrastructure, so that they can give you additional information.
Monitoring is a particularly useful when trying to determine if there is a problem, and sometimes it can give you enough information to indicate the source of some types of problems. Nevertheless, quite often you need to perform real debugging in order to fix a problem. We will take a look at some of your options for doing just that with the tools listed below.
Visual Studio .NET
I cannot talk about debugging without talking about the premier debugger of all, Visual Studio .NET. Visual Studio .NET is very useful, and can be used in production environments if need be. Realize, however, that if you hit a break point in your code, not only is your request stopped, but any other requests hitting your machine are stopped as well. Sometimes problems can be avoided, however, by pulling a machine out of the round robin system for your Web farm. This is a decent way to avoid further requests hitting your machine, but if your XML Web Service is handling several simultaneous requests, then all current requests being handled by your XML Web Service will be delayed and most likely lost when the client applications that sent the requests time out.
Visual Studio .NET does have some nice features that make it nicer to use in a production environment. The ability to debug remotely is great; however, you have to make sure you set up your permissions properly, so that you don't allow unauthorized access. Also Visual Studio .NET now has the ability to disconnect from a running process without causing the process to end. For those of you used to previous versions of Visual Studio, or other common Windows debuggers, detaching the debugger from a process has historically meant that the process was killed. Obviously, this might have some ramifications for production environments, since you don't want to keep your debugger attached to your process until the process ends, nor to kill your process and cause it to be restarted.
The disadvantage of using Visual Studio .NET, or any other standard debugger, is that often a production environment cannot be down for the length of time it might take a typical problem to be debugged. In certain cases this may be necessary, but quite often you can get away with simply taking a snapshot of the state of the current process.
Snapshot Debuggers: Autodump+, COM+ Dumps, and Dr. Watson Dumps
One of the options for debugging a running process is to take a snapshot of the current state of the process. This will not allow you to step through the code like you can in a standard debugger, but it will allow you to see all the threads in a particular process and what they are trying to do. This can be useful in many cases, such as when one thread is in an infinite loop, or all your threads in your thread pool are blocked waiting on some global resource, like a critical section.
The steps for taking a snapshot of the process of your XML Web Service is fairly straight-forward. There are basically two ways that you can invoke a debug snapshot. One is to manually trigger a dump using the Autodump+ tool. This is useful in situations where there are blocked threads or resource contention problems, and specific faults are not being thrown. The other option is to configure Autodump+ to create a memory dump in situations where a fault has occurred. This is useful if your code is experiencing exceptions of some sort that is causing its problems.
Sometimes you will want to create dumps in situations where a first chance exception occurs that would not normally bring down the process. Autodump+ gives you a lot of flexibility in regards to when faults of various sorts cause a snapshot to be made. On Windows XP, COM+ can be configured to create a dump file on faults as well, but it does not give you the flexibility to configure first chance exceptions to create dumps or faults of certain types. Dr. Watson can also be configured to create dump files when processes crash, but like the COM+ capabilities, it does not give you the flexibility to get a snapshot when a less severe error is encountered. See Microsoft Knowledge Base article Q286350: HOWTO: Use Autodump+ to Troubleshoot "Hangs" and "Crashes") for more information on the Autodump+ utility.
You can read the dump files created by Autodump+, COM+, or Dr. Watson with Visual Studio .NET. You will be working in the native mode debugger for the most part. It is important that you have debug symbol files for the components involved. It is also nice when reading these logs to have the symbol files for the version of the operating system involved as well. But even with symbol files, analyzing dumps can be pretty tricky. One of the things you will definitely want to do is to become familiar with what a typical snapshot of your XML Web Service's process looks like when it is working properly. Then you will need to compare that with the results that you get when you are experiencing a problem. Look out for things such as message box functions in the callstacks for threads. This means that a thread is displaying a message box, which, because it is not connected to the interactive desktop, will not be displayed where someone can click on it. The thread displaying the message box is basically hung forever at that point.
You should also be aware that when you use any of the various tools to take a snapshot of your production processes, that the memory footprint can be quite large. Production hard drives typically can handle the size of the file created without too much problem, but it may take many seconds to several minutes to write the dump file. Although this might be better than being down for a couple hours while you debug with a normal debugger, it is far from zero downtime.
One of the cooler debugging options of ASP.NET is the ability to perform tracing at a pretty low level. This is great for debugging Web form applications, but it can also be used to debug your XML Web Services as well. The nice thing about the way that ASP.NET tracing is implemented, is that it can be done in such a way that production scenarios can take advantage of the tracing without exposing themselves to risks or causing their environment to be sluggish or insecure. For a general overview of tracing, see MSDN's Nothing But ASP.NET article, Tracing.
The keys to using ASP.NET tracing in your production environment is to write tracing code that can give you meaningful information and then configure your system when you are interested in seeing traces. To add tracing to your production code, determine if tracing is enabled, and then write concise information using the TraceContext class to indicate critical pieces of information, such as when certain tasks have started or completed, or when certain error conditions occur. Do not forget to include information, such as the values of critical variables. The following code shows how you can detect whether or not tracing is enabled, and then write to the tracing log. Remember that this is code that can be used wherever the current HttpContext object is available. In the case of your XML Web Service, this should be just about everywhere within the handling of your request, including calls into other classes.
If HttpContext.Current.Trace.IsEnabled Then HttpContext.Current.Trace.Write("MyApplication", "Entering the GetData function") End If
Once you have added tracing to your XML Web Service code, you need to configure your system to perform tracing. This is where ASP.NET really shines in my opinion—by providing mechanisms for performing tracing in a secure fashion, and in such a way that a busy XML Web Service will not be significantly impacted. It will also allow you to receive only the data you really need.
Tracing configuration options are controlled inside the Web.config file in your ASP.NET application. Below is the Trace element of a Web.config file configured for performing production tracing.
<trace enabled="true" requestLimit="10" pageOutput="false" traceMode="SortByTime" localOnly="true" />
The pageOutput attribute is set to false. If it were set to true, it would allow you to see the extensive tracing information ASP.NET provides at the bottom of your typical HTML page. The problem with using this approach for XML Web Service tracing, is that you do not get to see any HTML, since it is a programmatic interface. Therefore, we turn this off.
If the tracing data is not available at the end of an HTML page using the pageOutput option, then you must view your tracing information by requesting the trace.axd file in your virtual directory from a browser. ASP.NET will provide you all the HTTP headers, server variables, as well as all your particular trace messages. One of the key questions, of course, is: How this can be considered a secure mechanism for debugging your XML Web Service, when anyone could potentially jump into their browser and see your tracing information? Setting the localOnly attribute to true, as is shown in the example only allows a browser on the local machine to see the tracing output. This might be a headache in a Web farm scenario, because you may have to manually track down the machine where a request was handled, but it does prevent anyone else from seeing the tracing information.
The last key piece of tracing information is the requestLimit option. This allows you to avoid tracing every single request that comes into your system. You only need to trace if you are having a problem, and because your code checks to see whether tracing is enabled before writing to the tracing logs, it will not bother with tracing in the vast majority of cases. When you suspect a problem, you can turn on tracing by specifying that only a certain number of requests will be traced, and then tracing will be disabled again. If your XML Web Service is servicing hundreds of requests, this is a convenient way to avoid being flooded with too much information. Basically, it will only take the first number of requests specified, and then process other requests without tracing.
If you want to be more intricate with your tracing, you can establish different levels of trace messages. For instance, you may have messages that would be considered errors, others that would be considered warnings, others that may just be considered informational, and others that you may only want to display when you are performing a verbose level of tracing. The System.Diagnostics.TraceSwitch class provides an easy mechanism for checking what level of tracing is configured for your application, so you can log trace messages accordingly.
Debugging in a production environment can be a challenging experience, but if you are prepared, it can make highly stressful situations significantly easier to deal with. By taking advantage of the infrastructure provided by Windows and the .NET Framework, you can use monitoring tools, traditional debuggers, snapshot debuggers, and ASP.NET tracing to narrow down your problems quickly and with a number of advantages that meet the requirements of production systems.
In the next At Your Service column, Scott will be back to introduce Phase II of MSDN's Favorites Service, a sample XML Web Service that meets real-world requirements for licensing, security, auditing, and reliability. You can see what we have in store for building on the success of Phase I of MSDN's first XML Web Service.
At Your Service