{ End Bracket }

Creating a Custom Metrics Tool

Stephen Toub

Code download available at:EndBracket0504.exe(137 KB)

Metrics play an important role in our lives. Even if we don't realize it or characterize it as such, many daily activities have the potential to be quantified to some degree. So it's not surprising that metrics play an even greater role in the workplace, where there are goals and a bottom line and where much of a day's activity can be summarized in numbers. The problem is often figuring out exactly how to quantify one's progress and success.

Editing MSDN®Magazine is no different. Some articles are well written and require little work on our part; others require days of detailed attention. While off-the-cuff estimates often suffice, it's important to be able to review the edits that an article has undergone and to somehow be able to quantify the changes. This helps us determine how much effort was put into a particular piece. That information can be used to critique past decisions and to plan for the future, so a tool that could analyze articles and entire issues to quantify the effort involved would be quite useful.

 88% Unadulterated!

88% Unadulterated!

We use Visual SourceSafe® (VSS) as a document repository. When an editor does an edit pass on an article, she checks it out, edits it, and checks it back in. Thus, each edit pass results in a separate version of that document in VSS. As a result, it's trivial to retrieve a version of an article as it existed before and after a particular edit pass.

With this in mind, I created a tool that uses the VSS client COM libraries to extract all versions of an article in order to compare each to the original document. Then it's just a click away to determine the metamorphosis of a document over its lifetime as well as how much a particular edit pass influenced that evolution. The tricky part is coming up with a valid metric: what does it mean for two documents to be similar?

In order to compare Microsoft® Word documents, first I need to obtain the text from each. Using the primary interop assembly (PIA) for Word, I create an instance of the Word ApplicationClass and open each document in turn. The unformatted text from the resulting Document object can then be obtained quickly with code such as the following:

Document doc = ...; // get the document doc.ActiveWindow.Selection.WholeStory(); // select all string text = doc.ActiveWindow.Selection.Range.Text;

Given the text for each document, how do I determine their similarity? Document similarity is an emerging field in computer science. Incorporating some of the field's ideas on n-gram-based metrics (an n-gram is a phrase made up of n grams, or words), my algorithm for comparing two documents works in two phases.

I first normalize each document's string, which includes optionally removing punctuation, and optionally lowercasing all text. It's then broken up into an array of its constituent words. From this array, I create a Hashtable of every n-gram in the string, where the key into the table is the n-gram phrase and the value of that entry is the number of times that phrase occurs in the string:

Hashtable CreateTable(string [] words, int n) { Hashtable table = new Hashtable(words.Length); for(int i=0; i<words.Length – n + 1; ++i) { string phrase = string.Join(" ", words, i, n); object val = table[phrase]; table[phrase] = (val==null) ? 1 : 1 + (int)val; } return table; }

The comparison operation is straightforward. I iterate over all of the phrases in the first Hashtable and look for each in the second Hashtable. If the phrase exists, this particular phrase from the first document exists in the second document, and the counts stored in each entry are used to determine whether the phrase was used the same number of times in each document. In this fashion, I build up a number that represents how many of the phrases from the first document exist in the second. This process is run again, but with the documents swapped (after all, if an editor adds a new paragraph, that constitutes a major change to the piece). At this point, I have a tally of hits of phrases between the two documents and can simply divide by the total number of phrases in both documents, arriving at a percentage. This percentage is meant to approximate the similarity of the final version of the document to the original.

Integrating this code into my GUI front end that retrieves documents from VSS, I was able to provide a utility to my team to provide these statistics on demand for both individual articles and for entire issues. The code I use to compare two documents is available for download from the link at the top of this article. So, for all of you out there who are as obsessive about numbers as I am, enjoy!

Stephen Toub is the Technical Editor for MSDN Magazine.