Click to Rate and Give Feedback
Related Articles
Here the author introduces SQL Server Data Services, which exposes its functionality over standard Web service interfaces.

By David Robinson (July 2008)
Here the author answers questions regarding the Entity Framework and provides an understanding of how and why it was developed.

By Elisa Flasko (July 2008)
Here we present techniques for programmatic and declarative data binding and display with Windows Presentation Foundation.

By Josh Smith (July 2008)
Systems that handle failure without losing data are elusive. Learn how to achieve systems that are both scalable and robust.

By Udi Dahan (July 2008)
More ...
Articles by this Author
In this month’s installment of .NET Matters, columnist Stephen Toub answers reader questions concerning asynchronous I/O .

By Stephen Toub (July 2008)
This month Stephen Toub discusses asynchronous stream processing.

By Stephen Toub (March 2008)
This month Stephen Toub explains how to make the most of dual processors when running encryption and compression tasks.

By Stephen Toub (February 2008)
The author creates a managed wrapper to use the new IFileOperations interface in Windows Vista from managed code.

By Stephen Toub (December 2007)
Find out how to use finalizers as a way to warn developers who use your custom types when they are garbage collected without having been disposed of correctly.

By Stephen Toub (November 2007)
This month Stephen Toub discusses deadlocks that can occur when synchronizing threads.

By Stephen Toub (October 2007)
Stephen Toub and Shawn Farkas discuss creating an adapter that takes the functionality of RNGCryptoServiceProvider and adapts it to the interface of Random.

By Stephen Toub and Shawn Farkas (September 2007)
Stephen Toub gets nostalgic as he prepares to leave MSDN Magazine.

By Stephen Toub (August 2007)
More ...
Popular Articles
Animating with Silverlight is easier than you think. Here we create a 3D app that folds a polyhedron using XAML, C#, and by emulating the DirectX math libraries.

By Declan Brennan (April 2008)
Learn how to automate custom SharePoint application deployments, use the SharePoint API, and avoid the hassle of custom site definitions.

By E. Wilansky, P. Olszewski, and R. Sneddon (May 2008)
Joel Pobar presents an introduction to how compilers work and how you can write your own compiler to target the .NET Framework.

By Joel Pobar (February 2008)
Chris Tavares explains how the ASP.NET MVC Framework's Model View Controller pattern helps you build flexible, easily tested Web applications.

By Chris Tavares (March 2008)
More ...
Read the Blog
SQL Server Data Services (SSDS) is a robust, scale-free data service that internally uses proven SQL Server technology and exposes its functionality over industry standard Web service interfaces. In the July 2008 issue of MSDN Magazine, David Robinson introduces ...
Read more!
Windows Presentation Foundation (WPF) offers excellent support for managing the display and editing of complex data. In the December 2007 edition of MSDN Magazine, John Papa did a great job of explaining essential WPF data binding concepts. ...
Read more!
The most fundamental form of Web testing is HTTP request/response testing. This involves programmatically sending an HTTP request to the Web application, fetching the HTTP response, and examining the response for an expected value. In the May 2008 issue of MSDN Magazine, Read more!
In the November issue of MSDN Magazine, Jeffrey Richter demonstrates some recent additions to the C# programming language that make working with the APM significantly easier. In the June ...
Read more!
The July 2008 issue of MSDN Magazine is now available online. Here's what's in the issue: Data Services: Develop ...
Read more!
The June 2008 issue features the first installment of a new MSDN Magazine column on software design fundamentals. We’ll discuss design patterns and principles in a manner that isn't bound to a specific tool or lifecycle methodology. In this issue, Jeremy Miller starts the Patterns in Practice column ...
Read more!
More ...
{ End Bracket }
Creating a Custom Metrics Tool
Stephen Toub

Code download available at: EndBracket0504.exe (137 KB)
Browse the Code Online

Metrics play an important role in our lives. Even if we don't realize it or characterize it as such, many daily activities have the potential to be quantified to some degree. So it's not surprising that metrics play an even greater role in the workplace, where there are goals and a bottom line and where much of a day's activity can be summarized in numbers. The problem is often figuring out exactly how to quantify one's progress and success.
Editing MSDN®Magazine is no different. Some articles are well written and require little work on our part; others require days of detailed attention. While off-the-cuff estimates often suffice, it's important to be able to review the edits that an article has undergone and to somehow be able to quantify the changes. This helps us determine how much effort was put into a particular piece. That information can be used to critique past decisions and to plan for the future, so a tool that could analyze articles and entire issues to quantify the effort involved would be quite useful.
 88% Unadulterated! 
We use Visual SourceSafe® (VSS) as a document repository. When an editor does an edit pass on an article, she checks it out, edits it, and checks it back in. Thus, each edit pass results in a separate version of that document in VSS. As a result, it's trivial to retrieve a version of an article as it existed before and after a particular edit pass.
With this in mind, I created a tool that uses the VSS client COM libraries to extract all versions of an article in order to compare each to the original document. Then it's just a click away to determine the metamorphosis of a document over its lifetime as well as how much a particular edit pass influenced that evolution. The tricky part is coming up with a valid metric: what does it mean for two documents to be similar?
In order to compare Microsoft® Word documents, first I need to obtain the text from each. Using the primary interop assembly (PIA) for Word, I create an instance of the Word ApplicationClass and open each document in turn. The unformatted text from the resulting Document object can then be obtained quickly with code such as the following:
Document doc = ...;                           // get the document
doc.ActiveWindow.Selection.WholeStory();      // select all
string text = doc.ActiveWindow.Selection.Range.Text;
Given the text for each document, how do I determine their similarity? Document similarity is an emerging field in computer science. Incorporating some of the field's ideas on n-gram-based metrics (an n-gram is a phrase made up of n grams, or words), my algorithm for comparing two documents works in two phases.
I first normalize each document's string, which includes optionally removing punctuation, and optionally lowercasing all text. It's then broken up into an array of its constituent words. From this array, I create a Hashtable of every n-gram in the string, where the key into the table is the n-gram phrase and the value of that entry is the number of times that phrase occurs in the string:
Hashtable CreateTable(string [] words, int n) {
    Hashtable table = new Hashtable(words.Length);
    for(int i=0; i<words.Length – n + 1; ++i) {
        string phrase = string.Join(" ", words, i, n);
        object val = table[phrase];
        table[phrase] = (val==null) ? 1 : 1 + (int)val;
    }
    return table;
}
The comparison operation is straightforward. I iterate over all of the phrases in the first Hashtable and look for each in the second Hashtable. If the phrase exists, this particular phrase from the first document exists in the second document, and the counts stored in each entry are used to determine whether the phrase was used the same number of times in each document. In this fashion, I build up a number that represents how many of the phrases from the first document exist in the second. This process is run again, but with the documents swapped (after all, if an editor adds a new paragraph, that constitutes a major change to the piece). At this point, I have a tally of hits of phrases between the two documents and can simply divide by the total number of phrases in both documents, arriving at a percentage. This percentage is meant to approximate the similarity of the final version of the document to the original.
Integrating this code into my GUI front end that retrieves documents from VSS, I was able to provide a utility to my team to provide these statistics on demand for both individual articles and for entire issues. The code I use to compare two documents is available for download from the link at the top of this article. So, for all of you out there who are as obsessive about numbers as I am, enjoy!

Stephen Toub is the Technical Editor for MSDN Magazine.

Page view tracker