.gif)
Performance Testing Guidance for Web Applications
J.D. Meier, Carlos Farre, Prashant Bansode, Scott Barber, and Dennis Rea
Microsoft Corporation
September 2007
Objectives
- Learn the uses, meanings of, and concepts underlying
common mathematical and statistical principles as they apply to
performance test analysis and reporting.
Overview
Members of software development teams, developers, testers,
administrators, and managers alike need to know how to apply mathematics and
interpret statistical data in order to do their jobs effectively. Performance
analysis and reporting are particularly math-intensive. This chapter describes
the most commonly used, misapplied, and misunderstood mathematical and
statistical concepts in performance testing, in a way that will benefit any
member of the team.
Even though there is a need to understand many mathematical
and statistical concepts, many software developers, testers, and managers
either do not have strong backgrounds in or do not enjoy mathematics and
statistics. This leads to significant misrepresentations and misinterpretation
of performance-testing results. The information presented in this article is
not intended to replace formal training in these areas, but rather to provide
common language and commonsense explanations for mathematical and statistical
operations that are valuable to understanding performance testing.
How to Use This Chapter
Use this chapter to understand the different metrics and
calculations that are used for analyzing performance data results and preparing
performance results reports. To get the most from this
chapter:
- Use the “Exemplar Data Sets” section to gain an understanding
of the exemplars, which are used to illustrate the key mathematical
principles explained throughout the chapter.
- Use the remaining sections to learn about key mathematical
principles that will help you to understand and present meaningful
performance testing reports.
Exemplar Data Sets
This chapter refers to three exemplar data sets for the
purposes of illustration, namely.
- Data Set A
- Data Set B
- Data Set C
Data Sets Summary
The following is a summary of Data Sets A, B, and C.
.gif)
Figure 15.1 Summary of Data Sets A, B, and
C
Data Set A
.gif)
Figure 15.2 Data Set A
100 total data points, distributed as follows:
- 5 data points have a value of 1.
- 10 data points have a value of 2.
- 20 data points have a value of 3.
- 30 data points have a value of 4.
- 20 data points have a value of 5.
- 10 data points have a value of 6.
- 5 data points have a value of 7.
Data Set B
.gif)
Figure 15.3 Data Set B
100 total data points, distributed as follows:
- 80 data points have a value of 1.
- 20 data points have a value of 16.
Data Set C
.gif)
Figure 15.4 Data Set C
100 total data points, distributed as follows:
- 11 data points have a value of 0.
- 10 data points have a value of 1.
- 11 data points have a value of 2.
- 13 data points have a value of 3.
- 11 data points have a value of 4.
- 11 data points have a value of 5.
- 11 data points have a value of 6.
- 12 data points have a value of 7.
- 10 data points have a value of 8.
Averages
An average ― also known as an arithmetic mean,
or mean for short ― is probably the most commonly used, and most
commonly misunderstood, statistic of all. To calculate an average, you simply
add up all the numbers and divide the sum by the quantity of numbers you just
added. What seems to confound many people the most when it comes to performance
testing is that, in this example, Data Sets A, B, and C each have an average of
exactly 4. In terms of application response times, these sets of data have
extremely different meanings. Given a response time goal of 5 seconds, looking
at only the average of these sets, all three seem to meet the goal. Looking at
the data, however, shows that none of the data sets is composed only of data
that meets the goal, and that Data Set B probably demonstrates some kind of
performance anomaly. Use caution when using averages to discuss response times
and, if at all possible, avoid using averages as the only reported statistic.
When reporting averages, it is a good idea to include the sample size, minimum
value, maximum value, and standard deviation for the data set.
Percentiles
Few people involved with developing software are familiar
with percentiles. A percentile is a straightforward concept that is easier
to demonstrate than define. For example, to find the 95th percentile
value for a data set consisting of 100 page-response-time measurements, you
would sort the measurements from largest to smallest and then count down six
data points from the largest. The 6th data point value represents
the 95th percentile of those measurements. For the purposes of
response times, this statistic is read “95 percent of the simulated users
experienced a response time of [the 6th-slowest value] or less for
this test scenario.”
It is important to note that percentile statistics can only
stand alone when used to represent data that is uniformly or normally
distributed with an acceptable number of outliers (see “Statistical Outliers”
below). To illustrate this point, consider the exemplar data sets. The 95th
percentile of Data Set B is 16 seconds. Obviously, this does not give the
impression of achieving the 5-second response time goal. Interestingly, this
can be misleading as well because the 80th percentile value of Data
Set B is 1 second. With a response time goal of 5 seconds, it is likely
unacceptable to have any response times of 16 seconds, so in this case neither
of these percentile values represent the data in a manner that is useful to
summarizing response time.
Data Set A is a normally distributed data set that has a 95th
percentile value of 6 seconds, an 85th percentile value of 5
seconds, and a maximum value of 7 seconds. In this case, reporting either the
85th or 95th percentile values represents the data in a
manner where the assumptions a stakeholder is likely to make about the data are
likely to be appropriate to the data.
Medians
A median is simply the middle value in a data set
when sequenced from lowest to highest. In cases where there is an even number
of data points and the two center values are not the same, some disciplines
suggest that the median is the average of the two center data points, while
others suggest choosing the value closer to the average of the entire set of
data. In the case of the exemplar data sets, Data Sets A and B have median
values of 4, and Data Set C has a median value of 1.
Normal Values
A normal value is the single value that occurs most
often in a data set. Data Set A has a normal value of 4, Data Set B has a
normal value of 3, and Data Set C has a normal value of 1.
Standard Deviations
By definition, one standard deviation is the amount
of variance within a set of measurements that encompasses approximately the top
68 percent of all measurements in the data set; in other words, knowing the
standard deviation of your data set tells you how densely the data points are
clustered around the mean. Simply put, the smaller the standard deviation, the
more consistent the data. To illustrate, the standard deviation of Data Set A
is approximately 1.5, the standard deviation of Data Set B is approximately 6.0,
and the standard deviation of Data Set C is approximately 2.6.
A common rule in this case is: “Data with a standard
deviation greater than half of its mean should be treated as suspect. If the
data is accurate, the phenomenon the data represents is not displaying a normal
distribution pattern.” Applying this rule, Data Set A is likely to be a
reasonable example of a normal distribution; Data Set B may or may not be a
reasonable representation of a normal distribution; and Data Set C is
undoubtedly not a reasonable representation of a normal distribution.
Uniform Distributions
Uniform distributions ― sometimes known as linear
distributions ― represent a collection of data that is roughly
equivalent to a set of random numbers evenly spaced between the upper and lower
bounds. In a uniform distribution, every number in the data set is represented
approximately the same number of times. Uniform distributions are frequently
used when modeling user delays, but are not common in response time results
data. In fact, uniformly distributed results in response time data may be an
indication of suspect results.
.gif)
Figure 15.5 Uniform Distributions
Normal Distributions
Also known as bell curves, normal distributions
are data sets whose member data are weighted toward the center (or median
value). When graphed, the shape of the “bell” of normally distributed data can
vary from tall and narrow to short and squat, depending on the standard
deviation of the data set. The smaller the standard deviation, the taller and
more narrow the “bell.” Statistically speaking, most measurements of human
variance result in data sets that are normally distributed. As it turns out,
end-user response times for Web applications are also frequently normally
distributed.
.gif)
Figure 15.6 Normal Distribution
Statistical Significance
Mathematically calculating statistical significance, or reliability,
based on sample size is a task that is too arduous and complex for most
commercially driven software-development projects. Fortunately, there is a
commonsense approach that is both efficient and accurate enough to identify the
most significant concerns related to statistical significance. Unless you have
a good reason to use a mathematically rigorous calculation for statistical
significance, a commonsense approximation is generally sufficient. In support
of the commonsense approach described below, consider this excerpt from a
StatSoft, Inc. (http://www.statsoftinc.com) discussion on the topic:
There
is no way to avoid arbitrariness in the final decision as to what level of
significance will be treated as really ‘significant.’ That is, the selection of
some level of significance, up to which the results will be rejected as
invalid, is arbitrary.
Typically, it is fairly easy to add iterations to performance
tests to increase the total number of measurements collected; the best way to
ensure statistical significance is simply to collect additional data if there
is any doubt about whether or not the collected data represents reality. Whenever
possible, ensure that you obtain a sample size of at least 100 measurements
from at least two independent tests.
Although there is no strict rule about how to decide which
results are statistically similar without complex equations that call for huge
volumes of data that commercially driven software projects rarely have the time
or resources to collect, the following is a reasonable approach to apply if
there is doubt about the significance or reliability of data after evaluating
two test executions where the data was expected to be similar. Compare results
from at least five test executions and apply the rules of thumb below to
determine whether or not test results are similar enough to be considered
reliable:
- If
more than 20 percent (or one out of five) of the test-execution results appear
not to be similar to the others, something is generally wrong with the test
environment, the application, or the test itself.
- If a
90th percentile value for any test execution is greater than the
maximum or less than the minimum value for any of the other test executions,
that data set is probably not statistically similar.
- If
measurements from a test are noticeably higher or lower, when charted
side-by-side, than the results of the other test executions, it is probably not
statistically similar.
.gif)
Figure 15.7 Result Comparison
- If one
data set for a particular item (e.g., the response time for a single page) in a
test is noticeably higher or lower, but the results for the data sets of the
remaining items appear similar, the test itself is probably statistically
similar (even though it is probably worth the time to investigate the reasons
for the difference of the one dissimilar data set.
Statistical Equivalence
The method above for determining statistical significance
actually is applying the principle of statistical equivalence. Essentially, the
process outlined above for determining statistical significance could be
restated as “Given results data from multiple tests intended to be equivalent,
the data from any one of those tests may be treated as statistically
significant if that data is statistically equivalent to 80 percent or more of
all the tests intended to be equivalent.” Mathematical determination of
equivalence using such formal methods as chi-squared and t-tests are not common
on commercial software development projects. Rather, it is generally deemed
acceptable to estimate equivalence by using charts similar to those used to
determine statistical significance.
Statistical Outliers
From a purely statistical point of view, any measurement
that falls outside of three standard deviations, or 99 percent, of all collected
measurements is considered an outlier. The problem with this definition is
that it assumes that the collected measurements are both statistically
significant and distributed normally, which is not at all automatic when
evaluating performance test data.
For the purposes of this explanation, a more applicable
definition of an outlier from a StatSoft, Inc. (http://www.statsoftinc.com) is
the following:
Outliers are atypical, infrequent
observations: data points which do not appear to follow the distribution of the
rest of the sample. These may represent consistent but rare traits, or be the
result of measurement errors or other anomalies which should not be modeled.
Note that this (or any other) description of outliers only
applies to data that is deemed to be a statistically significant sample of
measurements. Without a statistically significant sample, there is no generally
acceptable approach to determining the difference between an outlier and a
representative measurement.
Using this description, results graphs can be used to
determine evidence of outliers — occasional data points that just don’t seem to
belong. A reasonable approach to determining if any apparent outliers are truly
atypical and infrequent is to re-execute the tests and then compare the results
to the first set. If the majority of the measurements are the same, except for
the potential outliers, the results are likely to contain genuine outliers that
can be disregarded. However, if the results show similar potential outliers,
these are probably valid measurements that deserve consideration.
After identifying that a dataset appears to contain
outliers, the next question is, how many outliers can be dismissed as “atypical
infrequent observations?”
There is no set number of outliers that can be unilaterally
dismissed, but rather a maximum percentage of the total number of observations.
Applying the spirit of the two definitions above, a reasonable conclusion would
be that up to 1 percent of the total values for a particular measurement that
are outside of three standard deviations from the mean are significantly atypical
and infrequent enough to be considered outliers.
In summary, in practice for commercially driven software
development, it is generally acceptable to say that values representing less
than 1 percent of all the measurements for a particular item that are at least three
standard deviations off the mean are candidates for omission in results
analysis if (and only if) identical values are not found in previous or
subsequent tests. To express the same concept in a more colloquial way:
obviously rare and strange data points that can’t immediately be explained,
account for a very small part of the results, and are not identical to any
results from other tests are probably outliers.
A note of caution: identifying a data point as an outlier
and excluding it from results summaries does not imply ignoring the data point.
Excluded outliers should be tracked in some manner appropriate to the project
context in order to determine, as more tests are conducted, if a pattern of
concern is identified in what by all indications are outliers for individual
tests.
Confidence Intervals
Because determining levels of confidence in data is even
more complex and time-consuming than determining statistical significance or
the existence of outliers, it is extremely rare to make such a determination
during commercial software projects. A confidence interval for a specific
statistic is the range of values around the statistic where the ‘true’
statistic is likely to be located within a given level of certainty.
Because stakeholders do frequently ask for some indication
of the presumed accuracy of test results ― for example, what is the
confidence interval for these results? ― another commonsense approach
must be employed.
When performance testing, the answer to that question is directly
related to the accuracy of the model tested. Since in many cases the accuracy
of the model cannot be reasonably determined until after the software is
released into production, this is not a particularly useful dependency. However,
there is a way to demonstrate a confidence interval in the results.
By testing a variety of scenarios, including what the team
determines to be “best,” “worst,” and “expected” cases in terms of the
measurements being collected, a graphical depiction of a confidence interval
can be created, similar to the one below.
.gif)
Figure 15.8 Usage Models
In this graph, a dashed line represents the performance
goal, and the three curves represent the results from the worst-case (most
performance-intensive), best-case (least performance-intensive), and
expected-case user community models. As one would expect, the blue curve from
the expected case falls between the best- and worst-case curves. Observing
where these curves cross the red line, one can see how many users can access
the system in each case while still meeting the stated performance goal. If the
team is 95-percent confident (by their own estimation) that the best- and
worst-case user community models are truly best- and worst-case, this chart can
be read as follows: the tests show, with 95-percent confidence, that between
100 and 200 users can access the system while experiencing acceptable
performance.
Although a confidence interval of between 100 and 200 users
might seem quite large, it is important to note that without empirical data
representing the actual production usage, it is unreasonable to expect higher
confidence in results than there is in the models that generate those results.
The best that one can do is to be 100-percent confident that the test results
accurately represent the model being tested.
Summary
Members of software development teams, developers, testers,
administrators, and managers alike need to know how to apply mathematics and
interpret statistical data in order to do their jobs effectively. Performance
analysis and reporting are particularly math-intensive. It is critical that
mathematical and statistical concepts in performance testing be understood so that
correct performance-testing analysis and reporting can be done.
.gif)