Click to Rate and Give Feedback
Related Articles

Developers often struggle with versioning workflows and their related classes. Matt Milner discusses the core issues related to workflow versioning and provides recommendations for making changes to workflow definitions, activities, and workflow services.

Matthew Milner

MSDN Magazine May 2009

...

Read more!

Udi Dahan explains how his team identified and overcame unforeseen problems while developing a large-scale software + services trading application.

Udi Dahan

MSDN Magazine April 2009

...

Read more!

Microsoft Velocity exposes a unified, distributed memory cache for client application consumption. We show you how to add Velocity to your data-driven apps.

Aaron Dunnington

MSDN Magazine June 2009

...

Read more!

This month we demonstrate how easy it is to use IronPython to test .NET-based libraries.

James McCaffrey

MSDN Magazine June 2009

...

Read more!

Jeremy Miller continues his discussion of persistence patterns by reviewing the Unit of Work design pattern and examining the issues around persistence ignorance.

Jeremy Miller

MSDN Magazine June 2009

...

Read more!

Also by this Author

This month we demonstrate how you can use the ThreadPool to support ordered execution without having to build custom thread pools yourself.

Stephen Toub

MSDN Magazine February 2009

...

Read more!

In this month’s installment, Stephen Toub examines some techniques for enforcing dependencies in the running order of asynchronous operations and builds a DependencyManagement class to help.

Stephen Toub

MSDN Magazine April 2009

...

Read more!

We take a look at planned support for parallel programming for both managed and native code in the next version of Visual Studio.

Stephen Toub and Hazim Shafi

MSDN Magazine October 2008

...

Read more!

Creating events on classes by adding the event keyword to a delegate member variable declaration.

Stephen Toub

MSDN Magazine November 2006

...

Read more!

This month Stephen Toub answers readers questions that include: How do I pass data to a new thread? Why can't I convert from "ref string" to "ref object"? And what's the difference between EventWaitHandle, AutoResetEvent and ManualResetEvent?

Stephen Toub

MSDN Magazine June 2006

...

Read more!

Popular Articles

Here we introduce you to some of the concepts behind the new F# language, which combines elements of functional and object-oriented .NET languages. We then help you get started writing some simple programs.

Ted Neward

MSDN Magazine Launch 2008

...

Read more!

Now you can perform efficient, sophisticated text analysis using regular expressions in SQL Server 2005.

David Banister

MSDN Magazine February 2007

...

Read more!

The MVP pattern helps you separate your logic and keep your UI layer free of clutter. This month learn how.

Jean-Paul Boodhoo

MSDN Magazine August 2006

...

Read more!

When incorporating the ASP.NET DataGrid control into your Web apps, common operations such as paging, sorting, editing, and deleting data require more effort than you might like to expend. But all that is about to change. The GridView control--the successor to the DataGrid-- extends the DataGrid's functionality it in a number of ways. First, it fully supports data source components and can automatically handle data operations, such as paging, sorting, and editing, as long as its bound data source object supports these capabilities. In addition, ...

Read more!

We introduce you to the benefits of building composite applications with the Composite Application Guidance for WPF from Microsoft patterns & practices.

Glenn Block

MSDN Magazine September 2008

...

Read more!

{ End Bracket }
Creating a Custom Metrics Tool
Stephen Toub

Code download available at: EndBracket0504.exe (137 KB)
Browse the Code Online

Metrics play an important role in our lives. Even if we don't realize it or characterize it as such, many daily activities have the potential to be quantified to some degree. So it's not surprising that metrics play an even greater role in the workplace, where there are goals and a bottom line and where much of a day's activity can be summarized in numbers. The problem is often figuring out exactly how to quantify one's progress and success.
Editing MSDN®Magazine is no different. Some articles are well written and require little work on our part; others require days of detailed attention. While off-the-cuff estimates often suffice, it's important to be able to review the edits that an article has undergone and to somehow be able to quantify the changes. This helps us determine how much effort was put into a particular piece. That information can be used to critique past decisions and to plan for the future, so a tool that could analyze articles and entire issues to quantify the effort involved would be quite useful.
 88% Unadulterated! 
We use Visual SourceSafe® (VSS) as a document repository. When an editor does an edit pass on an article, she checks it out, edits it, and checks it back in. Thus, each edit pass results in a separate version of that document in VSS. As a result, it's trivial to retrieve a version of an article as it existed before and after a particular edit pass.
With this in mind, I created a tool that uses the VSS client COM libraries to extract all versions of an article in order to compare each to the original document. Then it's just a click away to determine the metamorphosis of a document over its lifetime as well as how much a particular edit pass influenced that evolution. The tricky part is coming up with a valid metric: what does it mean for two documents to be similar?
In order to compare Microsoft® Word documents, first I need to obtain the text from each. Using the primary interop assembly (PIA) for Word, I create an instance of the Word ApplicationClass and open each document in turn. The unformatted text from the resulting Document object can then be obtained quickly with code such as the following:
Document doc = ...;                           // get the document
doc.ActiveWindow.Selection.WholeStory();      // select all
string text = doc.ActiveWindow.Selection.Range.Text;
Given the text for each document, how do I determine their similarity? Document similarity is an emerging field in computer science. Incorporating some of the field's ideas on n-gram-based metrics (an n-gram is a phrase made up of n grams, or words), my algorithm for comparing two documents works in two phases.
I first normalize each document's string, which includes optionally removing punctuation, and optionally lowercasing all text. It's then broken up into an array of its constituent words. From this array, I create a Hashtable of every n-gram in the string, where the key into the table is the n-gram phrase and the value of that entry is the number of times that phrase occurs in the string:
Hashtable CreateTable(string [] words, int n) {
    Hashtable table = new Hashtable(words.Length);
    for(int i=0; i<words.Length – n + 1; ++i) {
        string phrase = string.Join(" ", words, i, n);
        object val = table[phrase];
        table[phrase] = (val==null) ? 1 : 1 + (int)val;
    }
    return table;
}
The comparison operation is straightforward. I iterate over all of the phrases in the first Hashtable and look for each in the second Hashtable. If the phrase exists, this particular phrase from the first document exists in the second document, and the counts stored in each entry are used to determine whether the phrase was used the same number of times in each document. In this fashion, I build up a number that represents how many of the phrases from the first document exist in the second. This process is run again, but with the documents swapped (after all, if an editor adds a new paragraph, that constitutes a major change to the piece). At this point, I have a tally of hits of phrases between the two documents and can simply divide by the total number of phrases in both documents, arriving at a percentage. This percentage is meant to approximate the similarity of the final version of the document to the original.
Integrating this code into my GUI front end that retrieves documents from VSS, I was able to provide a utility to my team to provide these statistics on demand for both individual articles and for entire issues. The code I use to compare two documents is available for download from the link at the top of this article. So, for all of you out there who are as obsessive about numbers as I am, enjoy!

Stephen Toub is the Technical Editor for MSDN Magazine.

Page view tracker