Test Run Competitive Analysis Using MAGIQ. Dr. James McCaffrey and Nasa Koski Competitive analysis, an important type of software testing, involves comparing your software system under test with competitor systems. Even if your system does not have explicit competitors, you can consider previous builds of your own system as implied competitors. The goal is to be able to compare the overall quality of your software system with similar systems. The key here is overall quality, which is not so easy to measure. You must take into account many different attributes such as functionality, security, and usability. You must determine how to rate each attribute of each system, and you must find a way to combine all the attribute ratings into a single quality metric for each system. In this month's column, I'm joined by Nasa Koski, a test lead in the MSN® division at Microsoft, and together we'll attempt to tackle the problem, presenting a methodology that seeks to translate data obtained through both measurement and perception into more easily comparable numeric statistics or metrics. We'll demonstrate and explain a powerful but very simple method called the MultiAttribute Global Inference of Quality (MAGIQ). This technique will enable you to quickly calculate the overall quality of your software system under test, thus allowing you to perform competitive analysis. A Little Bit of MAGIQ The MAGIQ technique is very general and can be applied to virtually any type of system including traditional applications, Web applications, class libraries, and so forth. Suppose, for example, you're developing a software system called System A that performs intranet searches. Imagine you want to rate it against four competitors' systems: System W, System X, System Y, and System Z. Using the MAGIQ procedure you will be able to compute metrics like those shown in Figure 1. Figure 1 MAGIQ Competitive Analysis Metrics Performance (0.2500)  Accuracy (0.7500)  Overall 

 Startup (0.2778)  Search (0.6111)  Save (0.1111)  Top Result (0.7500)  First Screen (0.2500)   System A  0.4567  0.2567  0.4567  0.2567  0.0900  0.2449  System W  0.0900  0.0400  0.0900  0.0400  0.1567  0.0667  System X  0.2567  0.1567  0.0400  0.4567  0.2567  0.3479  System Y  0.1567  0.4567  0.2567  0.0900  0.0400  0.1459  System Z  0.0400  0.0900  0.1567  0.1567  0.4567  0.1947 
Figure 1 shows the comparison of the five software systems based on performance and accuracy attributes. Both of these attributes have subattributes. The overall quality of System X is best with a quality metric of 0.3479 and the quality of System W is worst at 0.0667. Based on these results we conclude that our product, System A, is currently the second best of the five systems. In the sections that follow we will set up the problem that leads to the results in Figure 1, detail how to use the MAGIQ process, describe how to interpret the results of the technique, and explain how you can adapt this procedure to meet your own needs. We think you'll find the ability to quickly conduct competitive analysis an important addition to your software testing and project management skill sets. Setting Up the Problem In practice we use the MAGIQ technique most often to compare relatively similar systems in order to gauge relative product quality, but we've also used it to compare different builds of a particular system in order to gauge quality progress. In order to compare the overall quality of a set of software systems, we have to take into account many attributes of the system. For the purposes of the hypothetical intranet search system example presented here, we are considering just performance and accuracy. In a realistic scenario, we would also consider attributes such as security, functionality, usability, and so forth. The MAGIQ technique is very adaptable and can handle any number of attributes. Next we decide to split the performance attribute into three subattributes: startup performance, core search speed, and save performance. We also decide to break things down into the accuracy of the top result (how good the number one search result is), and firstscreen accuracy (how good are the first 10 or so search results that fill the initial results screen). As we'll demonstrate, MAGIQ can accommodate as many levels of attribute decomposition as you want. Our goal is to produce a single numeric value that represents the overall quality for each software system. After deciding which attributes you are going to use as the quality criteria, you would then have to determine how to compare each system on each attribute, determine how to assign values for that information, and then finally figure out how to combine all that data into a meaningful, overall quality metric, and how to interpret the results. Figure 2 shows the example expressed as a diagram. The diagram form might seem strange compared to, say, a table, but as we'll explain later this form makes some of the MAGIQ calculations easier to understand. The diagram form tends to be a bit awkward for software engineers to interpret because the hierarchical structure combines different elements (problem, attributes, systems) than the more usual structure which relates similar elements (directories and files). In the end, any form of problem setup is okay as long as it clearly conveys what the problem is and what attributes you are using as the basis for comparison in the MAGIQ analysis. Figure 2 Hierarchical Diagram Representation of Problem A problem diagram or table is a convenient way to clearly communicate exactly what your MAGIQ quality metrics are, especially to people external to your group. An additional advantage is that it forces you to thoroughly analyze exactly which factors you are going to use as comparison criteria. This process often uncovers issues you may have overlooked. Determining Weights of the Attributes The next step is to determine the relative weights of each of the comparison attributes. In most situations, certain attributes are more important than others. For example, overall system performance may be more important to us than overall accuracy or vice versa. The MAGIQ technique uses an interesting concept called rank order centroids (ROCs). ROCs are a way to convert ranks (such as first, second, third) into ratings or weights, which are numeric values (such as 0.6, 0.3, 0.1). For this example, we begin by looking at our toplevel attributes: performance and accuracy. We rank them from most important to least important. Here we determine that overall accuracy is most important and overall performance is second. Then we compute the ROC for each of these two highlevel attributes:
Accuracy: w1 = (1 + 1/2) / 2 = 0.7500
Performance: w2 = (0 + 1/2) / 2 = 0.2500
We will explain the calculations in a moment but the result is that accuracy is assigned a weight of 0.7500 and performance is assigned a weight of 0.2500. Next we compute the ROCs for each subattribute group. For the three performance subattributes, we decide that core search performance is most important, startup performance is second, and save performance is third. So the weights are as follows:
Search performance: w1 = (1 + 1/2 + 1/3) / 3 = 0.6111
Startup performance: w2 = (0 + 1/2 + 1/3) / 3 = 0.2778
Save performance: w3 = (0 + 0 + 1/3) / 3 = 0.1111
Then for the two accuracy subattributes, we decide that topresult accuracy is most important and that firstscreen accuracy is second in importance. So those weights are calculated:
Topresult accuracy: w1 = (1 + 1/2) / 2 = 0.7500
Firstscreen accuracy: w2 = (0 + 1/2) / 2 = 0.2500
Now, let's see how ROCs are calculated. The calculation pattern is pretty easy to see, especially if we show you what it looks like for a set of four items:
w1 = (1 + 1/2 + 1/3 + 1/4) / 4 = 0.5208
w2 = (0 + 1/2 + 1/3 + 1/4) / 4 = 0.2708
w3 = (0 + 0 + 1/3 + 1/4) / 4 = 0.1458
w4 = (0 + 0 + 0 + 1/4) / 4 = 0.0625
Notice that the ROC values add up to 1.0 (subject to rounding error). Expressed in sigma notation, if N is the number of attributes then the weight of the kth attribute is: / N
Sigma Notation of the Weight of the kth Attribute This is easy to compute. For example, the code for a simple C# implementation without any errorchecking is shown in Figure 3.The Math Behind Rank Order Centroids
Although a deep explanation of the mathematics behind rank order centroids is outside the scope of this column, we’ll briefly describe the rationale. The idea is to convert ranks (first, second, third, fourth) into values that are normalized on a 0.0 to 1.0 interval scale. An obvious way to try this is to assume each rank is distributed evenly within the unit interval. So, first => 0.80, second => 0.60, third => 0.40, and fourth => 0.20. But ranks are really a form of rate data—1st/4, 2nd/4, and so forth. And you may remember from an elementary statistics class that rate data is best handled using harmonic techniques. For example the average of 30 mph and 60 mph over a fixed distance is not (30 + 60) / 2 = 45 mph, but rather 2 / (1/30 + 1/60) = 40 mph. Notice that calculations for ROCs are similar. The concept of a centroid of a feasible space (simplex) is not new. The term "rank order centroid" was coined by F. H. Barron and B. E. Barrett, who also argued for its use in multiattribute decision problems.
Figure 3 Generate Rank Order Centroids
using System;
class Program
{
static void Main(string[] args)
{
try
{
Console.WriteLine("\nGenerating rank order centroids\n");
int N = 5;
Console.WriteLine("N = " + N);
Console.WriteLine("===========");
for (int k = 1; k <= N; ++k)
{
Console.WriteLine("w" + k + " = " +
roc(N, k).ToString("0.0000"));
}
Console.WriteLine("\nDone");
}
catch (Exception ex)
{
Console.WriteLine("Fatal error: " + ex.Message);
}
Console.ReadLine();
}
static double roc(int N, int k)
{
double result = 0.0;
for (int i = k; i <= N; ++i)
{
result += (1.0 / i);
}
return result / N;
}
}
Running the code in Figure 3 for N = 5 results in the following rank order centroids:
N = 5

w1 = 0.4567
w2 = 0.2567
w3 – 0.1567
w4 = 0.0900
w5 = 0.0400
Note that for any given value of N, there will be N weights and that the values of these weights are always the same. Therefore, an alternative to computing the weights for a given N each time you perform a MAGIQ analysis is to simply construct a set of tables for various values of N. For example:
N = 2

w1 = 0.7500
w2 = 0.2500
N = 3

w1 = 0.6111
w2 = 0.2778
w3 = 0.1111
Once you have these tables, you can simply do a lookup when needed. This approach is particularly useful if you distribute the responsibility for doing competitive analysis using MAGIQ among several members of your team. For more on the calculations, see the sidebar "The Math Behind Rank Order Centroids." Compare Each System on Each Attribute The next step is to compare each system based on each of the lowest level comparison attributes. In this example that means we need to rank System A against System W against System X against System Y against System Z on each of the five comparison attributes: startup performance, core search performance, save performance, topresult accuracy, and firstscreen accuracy. This process is exactly the same as the attribute comparison process. We rank each subattribute, then convert the ranks into priority weights using rank order centroids. For startup performance, suppose we determine that my System A is best, System X is ranked second, System Y third, System W fourth, and System Z fifth. Then the ROCs for startup performance are as follows:
System A: 1st: w1 = (1 + 1/2 + 1/3 + 1/4 + 1/5) / 5 = 0.4567
System X: 2nd: w2 = (0 + 1/2 + 1/3 + 1/4 + 1/5) / 5 = 0.2567
System Y: 3rd: w3 = (0 + 0 + 1/3 + 1/4 + 1/5) / 5 = 0.1567
System W: 4th: w4 = (0 + 0 + 0 + 1/4 + 1/5) / 5 = 0.0900
System Z: 5th: w5 = (0 + 0 + 0 + 0 + 1/5) / 5 = 0.0400
For search performance, we determine that the systems rank, from best to worst, as Y, A, X, Z, W. For save performance, the rank is A, Y, Z, W, X. For topresult accuracy we determine that the systems rank X, A, Z, Y, W. And for firstscreen accuracy, we rank the systems as Z, X, W, A, Y. These rankings are summarized in the table in Figure 4. Using these ranks we compute the relative weights of each system using ROCs as already described. Figure 4 System Rankings
 Startup  Search  Save  Top  Screen 

1st  A  Y  A  X  Z  2nd  X  A  Y  A  X  3rd  Y  X  Z  Z  W  4th  W  Z  W  Y  A  5th  Z  W  X  W  Y 
Final Evaluation After setting up the problem, computing the attribute priority ratings, and computing the system comparison ratings, the last step is to aggregate all the intermediate data to produce the final evaluation metrics. It's easiest to explain by example. If we put the intermediate results into a table, the result is Figure 5. Figure 5 Intermediate Results Performance (0.2500)  Accuracy (0.7500) 

 Startup (0.2778)  Search (0.6111)  Save (0.1111)  Top Result (0.7500)  First Screen (0.2500)  System A  0.4567  0.2567  0.4567  0.2567  0.0900  System W  0.0900  0.0400  0.0900  0.0400  0.1567  System X  0.2567  0.1567  0.0400  0.4567  0.2567  System Y  0.1567  0.4567  0.2567  0.0900  0.0400  System Z  0.0400  0.0900  0.1567  0.1567  0.4567 
The final overall quality value for System A is:
(.2500)(.2778)(.4567) +
(.2500)(.6111)(.2567) +
(.2500)(.1111)(.4567) +
(.7500)(.7500)(.2567) +
(.7500)(.2500)(.0900) = 0.2449
The final quality metric for each system is the weighted sum of its attribute rankings. We obtain this by following the attribute tree to the value for the system. For System A this can be expressed as:
(PerformanceRank * StartRank * SystemA_StartRank) +
(PerformanceRank * SearchRank * SystemA_SearchRank) +
(PerformanceRank * SaveRank * SystemA_SaveRank) +
(AccuracyRank * TopResultRank * SystemA_TopResultRank) +
(AccuracyRank * FirstScreenRank * SystemA_FirstScreenRank)
This expression looks more complicated than it really is. If you match the numbers in the calculation above with the attribute tree in Figure 2, you'll see exactly how the calculation works. In the same way, the final quality value for System W is:
(.2500)(.2778)(.0900) +
(.2500)(.6111)(.0400) +
(.2500)(.1111)(.0900) +
(.7500)(.7500)(.0400) +
(.7500)(.2500)(.1567) = 0.0667
The final quality metrics for Systems X, Y, and Z are calculated similarly, producing the results shown in Figure 1. Notice that the structure of the table in Figure 5 mirrors the diagram form of the problem statement shown in Figure 2. At this point we can interpret the quality of the five systems. Because the MAGIQ quality values sum to 1.0 you can compare each system by looking at its overall quality metric. System X is clearly best at 0.3479. System W is clearly worst at 0.0667. System A is second best, followed by System Z and then System Y. The fourdecimal precision of the data is somewhat deceptive. Because the original ranking input data is so crude, based on our experience you should only look at two decimals. Additionally, we suggest that you use MAGIQ data primarily to monitor trends. A single MAGIQ analysis provides valuable information, but if you perform a series of MAGIQ analyses over time, you will get additional information about the relative quality of your systems under test. You should use caution when interpreting the magnitude of the MAGIQ quality metrics. In this example, System X has an overall quality metric of 0.3479 and System A is at 0.2449. It is mathematically correct to say that, relative to System X, System A is (0.3479–0.2449) / 0.3479 = 30 percent worse than System X. However, the magnitudes of the quality metrics depend in part upon the number of systems being compared. Suppose for example, we were comparing only four systems instead of five. We would likely get the same order of results, but the magnitudes would be different, and therefore the relative difference between the quality metrics would differ too. If you adopt MAGIQ as one of your competitive analysis techniques, you'll quickly gain an intuitive sense of what the magnitudes of your metrics mean. As a rule of thumb, when comparing four or five software systems, we have found that a difference of 0.10 between two systems is significant (in the normal use of the word rather than in the statistical sense). There is another argument for using rank order centroids. Imagine that you have just two items, A and B, which are ranked first and second, respectively. Because you know A is better than B, A's rating must be somewhere between 0.5 and 1.0, and B's rating must be somewhere between 0.0 and 0.5. If you take the midpoint of each of these two intervals, you get 0.7500 and 0.2500 as with ROCs. With just two items, you are essentially working on a onedimensional line. ROCs generalize this concept across an Ndimensional space. One detail to address is what to do in the case of tied rankings for comparison attributes or systems under comparison. In this situation you compute ROCs for the tied attributes or systems, arbitrarily ranking one higher than the other, then take the average of the resulting rating values. A quick example will show you what we mean. Suppose you are comparing four systems, A, B, C, and D, on some attribute (say authentication security). You determine that system A is best, systems B and C are tied, and System D is fourth. Without the tie, the rank order centroid weights are w1 = 0.5208, w2 = 0.2708, w3 = 0.1458, and w4 = 0.0625. You take the values for the tied systems (w2 and w3), compute their average (0.2708 + 0.1458 / 2 = 0.2803), and use that average for each of the tied systems. MAGIQ versus AHP The MAGIQ technique is a close cousin to a multiattribute technique called the analytic hierarchy process (AHP). In Test Run in the June 2005 issue of MSDN®Magazine I describe the use of AHP for build quality (see msdn.microsoft.com/msdnmag/issues/05/06/TestRun). We originally developed MAGIQ as a way of validating build quality metrics obtained using AHP. The analytic hierarchy process decomposes a problem very much like MAGIQ, but AHP uses a pairwise comparison method instead of rank order centroids. Pairwise comparisons produce more accurate attribute weights than ROCs, but they take a lot of time and effort. For example, with 7 systems and 10 comparison attributes, you would have 21 comparisons per system times 10 attributes equal to 210 comparisons to perform. We soon discovered that our MAGIQ results correlated almost perfectly with our AHP results, so we started using the quicker MAGIQ technique on a daily basis, and used AHP just once a week to validate our MAGIQ metrics. Another good feature of the MAGIQ technique is that it is very easy to understand. With just a few minutes explanation (or this column), everyone on your team can learn how to perform a MAGIQ analysis. This allows you to either distribute responsibility for competitive analysis or to have multiple evaluators. Allowing several people on your team to perform competitive analysis using MAGIQ can have a large moralebuilding effect since team members will be making a significant contribution to the overall system under development. Although we've used the MAGIQ technique on several significant software products, it has not been subjected to serious academic investigation or research. That said, however, we think you'll find that MAGIQ can be an extremely useful complement to other, traditional software quality techniques such as bug count metrics. Based on our experience you should not rely completely on any single software system quality metric or technique. As the software development environment continues to mature, techniques like MAGIQ will become increasingly important components of your software engineering skill set. Send your questions and comments to testrun@microsoft.com.
Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at Microsoft. He has worked on several Microsoft products including Internet Explorer and MSN Search. James can be reached at
jmccaffrey@volt.com or vjammc@microsoft.com.
Nasa Koski is a Software Test Lead at Microsoft. Currently, she leads a team of system engineers at MSN who provide systems expertise and design in an extensive lab environment. Nasa can be reached at nasak@microsoft.com.

Receive the MSDN Flash email newsletter every other week, with news and information personalized to your interests and areas of focus.
