December 2010

Volume 25 Number 12

.NET Performance - Performance Diagnostics of .NET Applications Using ETW

By Subramanian Ramaswamy | December 2010

You write a managed application and take it for a spin—and it’s slow. Your application is functionally correct, but its performance is much to be desired. You’d like to diagnose the performance issues and resolve them, but your application is running in a production environment, so you can’t install profilers or disrupt it. Or your application may not be used widely enough to justify buying Visual Studio Profiler for CPU profiling.

Happily, Event Tracing for Windows (ETW) can mitigate these issues. This powerful logging technology is built into many parts of the Windows infrastructure and is leveraged in the Microsoft .NET Framework 4 CLR to make it simpler than ever to profile your managed application. ETW collects system-wide data and profiles all resources (CPU, disk, network and memory), making it extremely useful for obtaining a holistic view. Moreover, the ETW ecosystem can be tuned so it’s low overhead, making it suitable for production diagnostics.

The goal of this article is to give you an idea of the power of using ETW to profile your managed application. I won’t cover everything—there are several OS events and CLR ETW events available for diagnostics that we won’t look at. But you will get insights into how the performance and functionality of your managed app can be improved dramatically using the ETW ecosystem. To get you started with ETW-based diagnostics for your managed code, I’ll showcase a sample investigation using the freely available ETW tool, PerfMonitor, downloadable from bcl.codeplex.com/releases/view/49601.

PerfMonitor

PerfMonitor lets you quickly and easily collect ETW performance data and generate useful reports. It’s not intended to be a replacement for deep analysis tools, such as the Visual Studio Profiler; rather, it provides you with an overview of an application’s performance characteristics and lets you perform some quick analyses.

There’s another tool for ETW diagnostics called XPerf, which is freely available through the Windows Performance Toolkit. However, while XPerf is great for native code profiling on Windows, it does not yet have deep support for managed code profiling. PerfMonitor, on the other hand, exposes the scope and power of profiling managed code using ETW. PerfMonitor has the ability to gather symbolic information associated with .NET Framework runtime code, making it valuable for .NET Framework performance investigations, although it doesn’t support the in-depth analysis that XPerf can provide. 

PerfMonitor is a fully self-contained tool, and is all you need to start profiling and diagnosing your managed application. The only other requirement is that you must be running at least Windows Vista or Windows Server 2008. PerfMonitor is a command-line tool, and typing PerfMonitor.exe usersGuide from its location will bring up an overview. If you have a customer whose program you want to diagnose under operating conditions—for instance, on a production server—all you need to do is copy the file over to that machine and you’re ready to start collecting profiles. The profiles can be analyzed offline if needed.

Four factors are generally examined during any performance investigation: CPU, disk I/O, memory and scalability. Most investigations start with the CPU, which affects both the startup and execution time of your application. Examining disk I/O is most helpful when diagnosing lengthy startup times (disk I/O is a major factor in cold startup times, which is the time it takes for an application to start when it’s not present in memory, such as after a reboot), whereas excessive memory consumption (or leaks) can cause your app to grow slower over time. Scalability matters if you want your application to achieve throughput proportional to the number of processors.

PerfMonitor helps you get a quick snapshot of all these except scalability, and it also provides you with enough information to dig deeper using other specialized tools. For example, for diagnosing issues with the CLR .NET garbage collection (GC) heap, the CLRProfiler is a better bet. However, PerfMonitor quickly informs you whether there’s an issue and you need to dig deeper using other tools. In some cases, PerfMonitor itself points out the problem and contains all the information you need to tackle a performance bug, as you’ll soon see. Take a look at the CLR Inside Out column, “Memory Usage Auditing for .NET Applications” (msdn.microsoft.com/magazine/dd882521), which discusses the importance of auditing your program for memory usage and planning for performance. Extending that philosophy, PerfMonitor quickly lets you audit many performance aspects of your managed program, not just memory.

A Sample Investigation: CsvToXml

The sample program I’ll diagnose using ETW converts a CSV file into an XML file. The source code, along with the solution package (and a sample input CSV file, data.csv), was formerly available at code.msdn.microsoft.com/mag201012ETW. To execute the program, run the command CsvToXml.exe data.csv output.xml.

Like many programs, CsvToXml was quickly stitched together and the developer never anticipated that it would be used for large CSV files. When I began using it in the real world, I found it was too slow. It took more than 15 seconds to process a 750K file! I knew there was a problem, but without a profiling tool, I’d really just be guessing as to the cause of the slowness. (Can you spot it just by looking at the source?) Luckily, PerfMonitor can help you figure it out.

Generating and Viewing the Program Trace

The first step is to do a quick audit of the application by executing the following command in an administrator command-prompt window (ETW collects data machine-wide, and hence needs admin privileges):

PerfMonitor runAnalyze CsvToXml.exe data.csv out.xml

This will start ETW logging, launch CsvToXml.exe, wait for 
CsvToXml to complete, stop logging and, finally, bring up a Web page showing the analysis for CsvToXml. In one easy step, you have a wealth of data to help you uncover the performance bottlenecks in CsvToXml.

The result of this command is captured in Figure 1. The page contains, among other data, the process ID, the command line used, and a breakdown of high-level performance data including CPU statistics, GC statistics and just-in-time (JIT) statistics. PerfMonitor also provides a first-level analysis on where to begin the diagnostics, with useful links to informational articles or other tools.

Figure 1 Performance Analysis for CsvToXml

Figure 1 Performance Analysis for CsvToXml

The report shows that the format conversion took nearly 14 seconds, of which 13.6 seconds were in the CPU with an average utilization of 99 percent. Thus, the scenario was CPU-bound.

The total time in GC and the GC Pause times are low, which is good; however, the max GC allocation rate is 105.1MB/sec, which is excessive—this merits further investigation.

CPU Analysis

The Detailed CPU Analysis provides a breakdown of CPU time, as shown in Figure 2, and there are three ways of reading CPU profile data. The bottom-up view quickly tells you which methods are consuming the most CPU time and should be diagnosed first. The top-down view is useful for finding out if your code needs architectural or structural changes and helps you understand the overall performance of your program. The caller-callee view indicates the relationship among methods—for example, which methods call which. 

Figure 2 Bottom-Up Analysis of CsvToXml.exe

Figure 2 Bottom-Up Analysis of CsvToXml.exe

Like other CPU profilers, PerfMonitor views give you the inclusive time (the time spent in a particular method, including time spent in its callees) and the exclusive time (the time spent in a particular method excluding callees). When inclusive and exclusive times are equal, the work is done within that particular method. PerfMonitor also provides a CPU Utilization graph that breaks down CPU usage over time for a particular method. Hovering over the column headings in the report gives you more details on what they mean.

Most performance investigations start with the bottom-up view, which is a listing of methods by exclusive time (this view is shown in Figure 2). Upon selecting the bottom-up view, you can see that the mscorlib method System.IO.File.OpenText is the one using the most CPU. Clicking that link brings up the caller-callee view for the OpenText method, which reveals that the CsvToXml.CsvFile.get_ColumnNames method is invoking OpenText from the program—and get_ColumnNames is consuming almost 10 seconds of CPU time (Figure 3). Furthermore, this method is called from CsvToXml.CsvFile.XmlElementForRow within a loop (XmlElementForRow itself is called from the Main method).

Figure 3 Caller-Callee View for get_ColumnNames

Figure 3 Caller-Callee View for get_ColumnNames

Thus, something seems to be amiss in these methods. Pulling up the code of these methods leads you to the problem, highlighted in Figure 4: the file is opened and parsed repeatedly inside a loop!

Figure 4 Method ColumnNames Is Invoked by Method XmlElementForRow

public string[] ColumnNames
{
  get
  {
    using (var reader = File.OpenText(Filename))
      return Parse(reader.ReadLine());
  }
}
public string XmlElementForRow(string elementName, string[] row)
{
  string ret = "<" + elementName;
  for (int i = 0; i < row.Length; i++)
    ret += " " + ToValidXmlName(ColumnNames[i]) + "=\"" + EscapeXml(row[i]) + "\"";
  ret += "/>";
  return ret;
}

Similar scenarios happen more frequently than you’d think. When the method was originally written, the developer may have believed it was going to be invoked only rarely (as was the case with ColumnNames), and thus may not have paid too much attention to its performance. However, situations often come along later that end up calling the method in a loop, and the performance of the application suffers.

In a CSV file, all rows have the same format, so there’s no point in doing this every time. You can hoist the ColumnNames functionality into the constructor, as in Figure 5, leaving the property to provide the cached column names. This ensures that the file is read only once.

Figure 5 Caching the Column Names for Better Performance

public CsvFile(string csvFileName)
{
  Filename = csvFileName;
    using (var reader = File.OpenText(Filename))
      ColumnNames = Parse(reader.ReadLine());
        
}
public string Filename { get; private set; }
public string[] ColumnNames { get; private set;}

After rebuilding, we execute the previous command again and find the application much snappier; the duration is now only 2.5 seconds.

Nonetheless, reviewing the data with the fix, you’ll notice that CPU time is still dominant. By again drilling into CPU time and looking at the bottom-up analysis, you can see that Regex.Replace is now the most-expensive method, and that it’s called from EscapeXml and ToValidXmlName. Because EscapeXml is the more-expensive method (330 ms exclusive time), check its source code:

private static string EscapeXml(string str)
{
  str = Regex.Replace(str, "\"", "&quote;");
  str = Regex.Replace(str, "<", "&lt;");
  str = Regex.Replace(str, ">", "&gt;");
  str = Regex.Replace(str, "&", "&amp;");
  return str;
}

EscapeXml is also called within a loop in XmlElementForRow, and thus has the potential to be a bottleneck. Regular expressions are a bit of overkill for these replacements, and using a string Replace method would be more efficient. So replace EscapeXml with the following:

private static string EscapeXml(string str)
{
  str = str.Replace("\"", "&quote;");
  str = str.Replace("<", "&lt;");
  str = str.Replace(">", "&gt;");
  str = str.Replace("&", "&amp;");
  return str;
}

With this transformation, you’ve reduced the overall time to approximately two seconds, with CPU time still dominant. This is acceptable performance—you’ve improved the execution speed almost sevenfold.

As an exercise for the reader, I’ve left a few more performance bugs in the sample program that can be identified using ETW events.

Exploring GC Statistics

PerfMonitor GC statistics supply a quick overview of the memory profile. As you may recall, I strongly recommend memory usage auditing, and the information provided through GC ETW events provides a quick snapshot of any problems with the .NET GC heap. The quick summary view tells you the aggregate GC heap size, the allocation rates and the GC pause times. Selecting the GC Time Analysis link on the PerfMonitor results tab shows you the details of the GCs, when they occurred, how much time they consumed and so forth.

The information lets you determine if you need to dig further into any memory issues using the CLRProfiler or other memory profilers. The article “Profiling the .NET Garbage-Collected Heap” (msdn.microsoft.com/magazine/ee309515) digs into debugging the .NET GC heap using the CLRProfiler.

For this particular program, none of the GC statistics looks worrisome. The allocation rate is high; a good rule of thumb is to have your allocation rate below 10MB/s. However, pause times are very small. High allocation rates show up under CPU time, which mostly means that there are CPU gains to be had—as you found out. However, after the fixes, the allocation rates remain high, and this means there are a lot of allocations happening (can you fix this?). The GC pause-time of a few milliseconds is a testament to the self-tuning and efficient GC that the .NET Framework runtime provides. Thus, the .NET Framework GC is automatically taking care of memory management.

Exploring JIT Statistics

To improve startup time, one of the first items to look into is the time needed for JIT compiling of methods. If the time taken is significant (for example, most of the time used during startup of your app is consumed by JIT compilation), the application might benefit from native image generation (NGen), which eliminates JIT compilation time by precompiling the assembly and saving it on disk. That is, the assembly is JIT-compiled and saved on disk, eliminating the need for JIT compilation for subsequent executions. Before going down the NGen route, though, you may also want to consider 
deferring some of the methods being JIT compiled to a later point in the program so the JIT compilation times don’t affect startup. For more information, please see the article, “The Performance Benefits of NGen” (msdn.microsoft.com/magazine/cc163610).

The sample application CsvToXml.exe didn’t have a significant startup cost, and allowing it to JIT compile all methods every time is fine. The JIT compilation statistics also tell you that the number of methods that were JIT compiled was 17 (suggesting that all the methods called were JIT compiled), and the total JIT compilation time was 23 ms. Neither of these is a performance issue with this application, but for larger applications where JIT compilation time is a factor, using NGen should eliminate any problems. Normally, JIT compilation time becomes a factor when an application starts JIT compiling hundreds or thousands of methods. In such cases, NGen is the solution to eliminate JIT compilation costs.

More guidance on improving startup is available in other MSDN Magazine articles, and the ETW events can help identify and fix bottlenecks. Several other JIT events are available as well, including JIT inline events that can provide insights into why a method wasn’t inlined.

CLR ETW Events in the .NET Framework 4

The CLR team wrote a blog post on tracking down DLL loads and determining whether a particular DLL needs to be loaded during startup. The process of determining whether a DLL load needs to happen during startup becomes simpler with ETW events. By using the ETW Module Load events available in the .NET Framework 4, we know which modules are loaded and why. There are also events for module unloads and so on.

There are several other events in the .NET Framework 4 that make diagnosing your managed application simpler than ever. Figure 6 summarizes these events. All the events that were triggered during execution can be dumped with the PerfMonitor runPrint command. The CLR team has also run down events that allow you to attach and detach ETW profiling, and the team plans to keep adding more ETW events to make the process of debugging managed applications simpler in future releases.

Figure 6 ETW Events in the .NET Framework 4

Event Category Name Description
Runtime Information ETW Event Captures information about the runtime, including the SKU, version number, the manner in which the runtime was activated, the command-line parameters it was started with, the GUID (if applicable) and other relevant information.
Exception Thrown ETW Event Captures information about exceptions that are thrown.
Contention ETW Events Captures information about contention for monitor locks or native locks that the runtime uses.
Thread Pool ETW Events Captures information about worker thread pools and I/O thread pools.
Loader ETW Events Captures information about loading and unloading application domains, assemblies and modules.
Method ETW Events Captures information about CLR methods for symbol resolution.
GC ETW Events Captures information pertaining to GC.
JIT Tracing ETW Events Captures information about JIT inlining and tail calls.
Interop ETW Events Captures information about Microsoft intermediate language (MSIL) stub generation and caching.
Application Domain Resource Monitoring (ARM) ETW Events Captures detailed diagnostic information about the state of an application domain.
Security ETW Events Captures information about strong name and Authenticode verification.
Stack ETW Event Captures information that’s used with other events to generate stack traces after an event is raised.

You’ll find two files with the suffix PerfMonitorOutput in the execution directory; these are the ETW log files. You’ll also see files with the suffix kernel, signifying they contain the OS events. The data collected by PerfMonitor is the same data that XPerf uses, so you can use PerfMonitor to simplify data collection and simple reporting and XPerf for more advanced analysis of the same data. The PerfMonitor merge command converts the ETW files to an XPerf-readable format.

Wrapping Up

Performance investigation using ETW is simple, yet powerful. Various free, low-overhead ETW-based tools are available that allow debugging of managed code efficiently. I’ve just skimmed the surface of the ETW events that are available in the .NET Framework runtime. My aim was to get you started debugging your managed application using ETW events and tooling. Downloading PerfMonitor and using the MSDN documentation of ETW events in CLR, along with the CLR Perf Blog, will jump-start your performance investigations of your managed applications.

A special thanks to Vance Morrison, partner architect for CLR Performance, for his guidance and assistance with this article.


Subramanian Ramaswamyis the program manager for CLR Performance at Microsoft. He holds a Ph.D. in Electrical and Computer Engineering from the Georgia Institute of Technology.