.gif)
Performance Testing Guidance for Web Applications
J.D. Meier, Carlos Farre, Prashant Bansode, Scott Barber, and Dennis Rea
Microsoft Corporation
September 2007
Objectives
- Understand common principles and considerations of
performance test execution.
- Understand the common activities of performance test
execution.
Overview
Performance test execution is the activity that occurs between
developing test scripts and reporting and analyzing test results. Much of the performance
testing–related training available today treats this activity as little more
than starting a test and monitoring it to ensure that the test appears to be
running as expected. In reality, this activity is significantly more complex
than just clicking a button and monitoring machines. This chapter addresses
these complexities based on numerous real-world project experiences.
How to Use this Chapter
Use this chapter to understand
the key principles and considerations underlying performance test execution and
the various activities that it entails. To get the most from
this chapter:
- Use the “Approach for Test Execution”
section to get an overview of the approach for performance
test execution and as quick reference guide for you and your team.
- Use the various activity sections to understand the
details of each activity involved in performance test execution.
Approach for Test Execution
The following activities are involved in performance test
execution:
- Validate the test environment
- Validate tests
- Run tests
- Baseline and benchmark
- Archive tests
The following sections discuss each of these activities in
detail.
Validate the Test Environment
The goal is for the test environment to mirror your
production environment as closely as possible. Typically, any differences
between the test and production environments are noted and accounted for while
designing tests. Before running your tests, it is important to validate that
the test environment matches the configuration that you were expecting and/or
designed your test for. If the test environment is even slightly different from
the environment you designed your tests to be run against, there is a high probability
that your tests might not work at all, or worse, that they will work but will
provide misleading data.
The following activities frequently prove valuable when
validating a test environment:
- Ensure that the test environment is correctly configured
for metrics collection.
- Turn off any active virus-scanning on load-generating
machines during testing, to minimize the likelihood of unintentionally
skewing results data as a side-effect of resource consumption by the
antivirus/anti-spyware software.
- Consider simulating background activity, when necessary.
For example, many servers run batch processing during predetermined time periods,
while servicing users’ requests. Not accounting for such activities in
those periods may result in overly optimistic performance results.
- Run simple usage scenarios to validate the Web server
layer first if possible, separately from other layers. Run your scripts
without think times. Try to run a scenario that does not include database
activity. Inability to utilize 100 percent of the Web server’s processor
can indicate a network problem or that the load generator clients have
reached their maximum output capacity.
- Run simple usage scenarios that are limited to reading
data to validate database scenarios. Run your script without think times.
Use test data feeds to simulate randomness. For example, query for a set
of products. Inability to utilize 100 percent of the Web server’s
processor can indicate a network problem or that the load-generator
clients have reached their maximum output capacity.
- Validate the test environment by running more complex usage
scenarios with updates and writes to the database, using a mix of test
scripts that simulate business actions.
- In Web farm environments, check to see if your load tests
are implementing Internet Protocol (IP) switching. Not doing so may cause
IP affinity, a situation where all of the requests from the load-generation
client machine are routed to the same server rather than being balanced
across all of the servers in the farm. IP affinity leads to inaccurate load
test results because other servers participating in the load balancing
will not be utilized.
- Work with key performance indicators (KPIs) on all the
servers to assess your test environment (processor, network, disk, and
memory). Include all servers in the cluster to ensure correct evaluation
of your environment.
- Consider spending time creating data feeds for the test
application. For example, database tables containing production data such
as number of users, products, and orders shipped, so that you can create
similar conditions to replicate problems in critical usage scenarios. Many
scenarios involve running queries against tables containing several
thousands of entries, to simulate lock timeouts or deadlocks.
Additional Considerations
Consider the following key points when troubleshooting
performance-testing environments:
- Look for problems in the load-generation clients from
which load is simulated. Client machines often produce inaccurate
performance-testing results due to insufficient processor or memory
resources. Consider adding more client computers to compensate for fast
transactions that may cause higher processor utilization; also consider using
more memory when this becomes the bottleneck. Memory can be consumed when
test data feeds are cached in load generators, or by more complex
scripting in load tests.
- Some network interface cards (NICs) when set to auto mode
will fail to negotiate with switches in proper full-duplex mode. The
result is that the NICs will operate in half-duplex negotiation, which
causes inaccurate performance-testing results. A typical perimeter network
with a Web server and database server in different layers will be deployed
with the Web server having two NICs, one facing your clients and another using
a different route to communicate with the database layer. However, be
aware that having one NIC in the Web server facing both the clients and the
database tier may cause network bottleneck congestion.
- The database server in the production environment may be
using separate hard drives for log files and data files associated with
the database as a matter of policy. Replicate such deployment
configurations to avoid inaccurate performance-testing results. Consider
that if DNS is not properly configured, it might cause broadcast messages
to be sent when opening database connections by using the database server
name. Name-resolution issues may cause connections to open slowly.
- Improper data feeds consumed by your scripts will frequently
cause you to overlook problems with the environment. For example, low processor
activity may be caused by artificial locking due to scripts querying the
same record from the database. Consider creating test data feeds that
simulate the correct business actions, accounting for variability of data
sent from the post request. Load-generation tools may use a central
repository such as a database or files in a directory structure to collect
performance test data. Make sure that the data repository is located on a machine
that will not cause traffic in the route used by your load-generation
tools; for example, putting the data repository in the same virtual
local-area network (VLAN) of the machine used to manage data collection.
- Load-generation tools may require the use of special
accounts between load-generator machines and the computers that collect
performance data. Make sure that you set such configurations correctly. Verify
that data collection is occurring in the test environment, taking into
consideration that the traffic may be required to pass through a firewall.
Validate Tests
Poor load simulations can render all previous work useless.
To understand the data collected from a test run, the load simulation must accurately
reflect the test design. When the simulation does not reflect the test design,
the results are prone to misinterpretation. Even if your tests accurately reflect
the test design, there are still many ways that the test can yield invalid or
misleading results. Although it may be tempting to simply trust your tests, it
is almost always worth the time and effort to validate the accuracy of your
tests before you need to depend on them to provide results intended to assist
in making the “go-live” decision. It may be useful to think about test
validation in terms of the following four categories:
- Test design implementation. To validate that you
have implemented your test design accurately (using whatever method you
have chosen), you will need to run the test and examine exactly what the
test does.
- Concurrency. After you have validated that your
test conforms to the test design when run with a single user, run the test
with several users. Ensure that each user is seeded with unique data, and
that users begin their activity within a few seconds of one another — not
all at the same second, as this is likely to create an unrealistically
stressful situation that would add complexity to validating the accuracy
of your test design implementation. One method of validating that tests
run as expected with multiple users is to use three test runs; one with 3
users, one with 5 users, and one with 11 users. These three tests have a
tendency to expose many common issues with both the configuration of the
test environment (such as a limited license being installed on an
application component) and the test itself (such as parameterized data not
varying as intended).
- Combinations of tests. Having validated that a
test runs as intended with a single user and with multiple users, the next
logical step is to validate that the test runs accurately in combination
with other tests. Generally, when testing performance, tests get mixed and
matched to represent various combinations and distributions of users,
activities, and scenarios. If you do not validate that your tests have
been both designed and implemented to handle this degree of complexity
prior to running critical test projects, you can end up wasting a lot of
time debugging your tests or test scripts when you could have been
collecting valuable performance information.
- Test data validation. Once you are satisfied that
your tests are running properly, the last critical validation step is to
validate your test data. Performance testing can utilize and/or consume
large volumes of test data, thereby increasing the likelihood of errors in
your dataset. In addition to the data used by your tests, it is important
to validate that your tests share that data as intended, and that the
application under test is seeded with the correct data to enable your
tests.
Dynamic Data
The following are technical reasons for using dynamic data correctly
in load test scripts:
- Using the same data value causes artificial usage of
caching because the system will retrieve data from copies in memory. This
can happen throughout different layers and components of the system,
including databases, file caches of the operating systems, hard drives,
storage controllers, and buffer managers. Reusing data from the cache
during performance testing might account for faster testing results than would
occur in the real world.
- Some business scenarios require a relatively small range
of data selection. In such a case, even reusing the cache more frequently
will simulate other performance-related problems, such as database
deadlocks and slower response times due to timeouts caused by queries to
the same items. This type of scenario is typical of marketing campaigns
and seasonal sales events.
- Some business scenarios require using unique data during
load testing; for example, if the server returns session-specific
identifiers during a session after login to the site with a specific set
of credentials. Reusing the same login data would cause the server to
return a bad session identifier error. Another frequent scenario is when
the user enters a unique set of data, or the system fails to accept the
selection; for example, registering new users that would require entering
a unique user ID on the registration page.
- In some business scenarios, you need to control the number
of parameterized items; for example, a caching component that needs to be
tested for its memory footprint to evaluate server capacity, with a
varying number of products in the cache.
- In some business scenarios, you need to reduce the script
size or the number of scripts; for example, several instances of an
application will live in one server, reproducing a scenario where an
independent software vendor (ISV) will host them. In this scenario, the Uniform
Resource Locators (URLs) need to be parameterized during load test
execution for the same business scenarios.
- Using dynamic test data in a load test tends to reproduce
more complicated and time-sensitive bugs; for example, a deadlock
encountered as a result of performing different actions using different
user accounts.
- Using dynamic test data in a load test allows you to use error
values if they suit your test plan; for example, using an ID that is
always a positive number when testing to simulate hacker behavior. It may
be beneficial to use zero or negative values when testing to replicate
application errors, such as scanning the database table when an invalid
value is supplied.
Test Validation
The following are some commonly employed methods of test
validation, which are frequently used in combination with one another:
- Run the test first with a single user only. This makes
initial validation much less complex.
- Observe your test while it is running and pay close
attention to any behavior you feel is unusual. Your instincts are usually
right, or at least valuable.
- Use the system manually during test execution so that you
can compare your observations with the results data at a later time.
- Make sure that the test results and collected metrics represent
what you intended them to represent.
- Check to see if any of the parent requests or dependent
requests failed.
- Check the content of the returned pages, as load-generation
tools sometimes report summary results that appear to “pass” even though
the correct page or data was not returned.
- Run a test that loops through all of your data to check
for unexpected errors.
- If appropriate, validate that you can reset test and/or
application data following a test run.
- At the conclusion of your test run, check the application
database to ensure that it has been updated (or not) according to your
test design. Consider that many transactions in which the Web server
returns a success status with a “200” code might be failing internally; for
example, errors due to a previously used user name in a new user
registration scenario, or an order number that is already in use.
- Consider cleaning the database entries between error
trials to eliminate data that might be causing test failures; for example,
order entries that you cannot reuse in subsequent test execution.
- Run tests in a variety of combinations and sequences to
ensure that one test does not corrupt data needed by another test in order
to run properly.
Additional Considerations
Consider the following additional points when validating
your tests:
- Do not use performance results data from your validation
test runs as part of your final report.
- Report performance issues uncovered during your validation
test runs.
- Use appropriate load-generation tools to create a load that
has the characteristics specified in your test design.
- Ensure that the intended performance counters for
identified metrics and resource utilization are being measured and
recorded, and that they are not interfering with the accuracy of the
simulation.
- Run other tests during your performance test to ensure that
the simulation is not impacting other parts of the system. These other
tests may be either automated or manual.
- Repeat your test, adjusting variables such as user names
and think times to see if the test continues to behave as anticipated.
- Remember to simulate ramp-up and cool-down periods
appropriately.
Questions to Ask
- What additional team members should be involved in
evaluating the accuracy of this test?
- Do the preliminary results make sense?
- Is the test providing the data we expected?
Run Tests
Although the process and flow of running tests are extremely
dependent on your tools, environment, and project context, there are some
fairly universal tasks and considerations to keep in mind when running tests.
Once it has been determined that the application under test
is in an appropriate state to have performance tests run against it, the
testing generally begins with the highest-priority performance test that can
reasonably be completed based on the current state of the project and
application. After each test run, compile a brief summary of what happened
during the test and add these comments to the test log for future reference.
These comments may address machine failures, application exceptions and errors,
network problems, or exhausted disk space or logs. After completing the final
test run, ensure that you have saved all of the test results and performance
logs before you dismantle the test environment.
Whenever possible, limit tasks to one to two days each to
ensure that no time will be lost if the results from a particular test or
battery of tests turn out to be inconclusive, or if the initial test design
needs modification to produce the intended results. One of the most important tasks
when running tests is to remember to modify the tests, test designs, and
subsequent strategies as results analysis leads to new priorities.
A widely recommended guiding principle is: Run test tasks
in one- to two-day batches. See the tasks through to completion, but be willing
to take important detours along the way if an opportunity to add additional
value presents itself.
Keys to Efficiently and Effectively Running Tests
In general, the keys to efficiently and effectively running
tests include:
- Revisit performance-testing priorities after no more than
two days.
- Remember to capture and use a performance baseline.
- Plan to spend some time fixing application errors, or
debugging the test.
- Analyze results immediately so that you can modify your
test plan accordingly.
- Communicate test results frequently and openly across the
team.
- Record results and significant findings.
- Record other data needed to repeat the test later.
- At appropriate points during test execution, stress the
application to its maximum capacity or user load, as this can provide
extremely valuable information.
- Remember to validate application tuning or optimizations.
- Consider evaluating the effect of application failover and
recovery.
- Consider measuring the effects of different system
configurations.
Additional Considerations
Consider the following additional points when running your
tests:
- Performance testing is frequently conducted on an isolated
network segment to prevent disruption of other business operations. If
this is not the case for your test project, ensure that you obtain
permission to generate loads during certain hours on the available
network.
- Before running the real test, consider executing a quick “smoke
test” to make sure that the test script and remote performance counters
are working correctly.
- If you choose to execute a smoke test, do not report the
results as official or formal parts of your testing.
- Reset the system (unless your scenario is to do otherwise)
before running a formal test.
- If at all possible, execute every test twice. If the
results produced are not very similar, execute the test again. Try to
determine what factors account for the difference.
- No matter how far in advance a test is scheduled, give the
team 30-minute and 5-minute warnings before launching the test (or
starting the day’s testing). Inform the team whenever you are not going to
be executing for more than one hour in succession.
- Do not process data, write reports, or draw diagrams on
your load-generating machine while generating a load because this can
corrupt the data.
- Do not throw away the first iteration because of script
compilation or other reasons. Instead, measure this iteration separately
so you will know what the first user after a system-wide reboot can
expect.
- Test execution is never really finished, but eventually
you will reach a point of diminishing returns on a particular test. When
you stop obtaining valuable information, change your test.
- If neither you nor your development team can figure out
the cause of an issue in twice as much time as it took the test to
execute, it may be more efficient to eliminate one or more
variables/potential causes and try again.
- If your intent is to measure performance related to a
particular load, it is important to allow time for the system to stabilize
between increases in load to ensure the accuracy of measurements.
- Make sure that the client computers (also known as
load-generation client machines) that you use to generate load are not
overly stressed. Utilization of resources such as processor and memory
should remain low enough to ensure that the load-generation environment is
not itself a bottleneck.
- Analyze results immediately and modify your test plan
accordingly.
- Work closely with the team or team sub-set that is most
relevant to the test.
- Communicate test results frequently and openly across the
team.
- If you will be repeating the test, consider establishing a
test data restore point before you begin testing.
- In most cases, maintaining a test execution log that captures
notes and observations for each run is invaluable.
- Treat workload characterization as a moving target. Adjust
new settings for think times and number of users to model the new total
number of users for normal and peak loads.
- Observe your test during execution and pay close attention
to any behavior you feel is unusual. Your instincts are usually right, or
at least valuable.
- Ensure that performance counters relevant for identified
metrics and resource utilization are being measured and are not
interfering with the accuracy of the simulation.
- Use the system manually during test execution so that you
can compare your observations with the results data at a later time.
- Remember to simulate ramp-up and cool-down periods
appropriately.
Questions to ask
- Have recent test results or project updates made this task
more or less valuable compared to other tests we could be conducting right
now?
- What additional team members should be involved with this
task?
- Do the preliminary results make sense?
Baseline and Benchmark
When baselines and benchmarks are used, they are generally
the first and last tests you will execute, respectively. Of all the tests that
may be executed during the course of a project, it is most important that
baselines and benchmarks be well understood and controlled, making the
validations discussed above even more important.
Baselines
Creating a baseline is the process of running a set
of tests to capture performance metric data for the purpose of evaluating the
effectiveness of subsequent performance-improving changes to the system or
application.
With respect to Web applications, you can use a baseline to
determine whether performance is improving or declining and to find deviations
across builds and versions. For example, you could measure load time, number of
transactions processed per unit of time, number of Web pages served per unit of
time, and resource utilization such as memory and processor usage. Some
considerations about using baselines include:
- A baseline can be created for a system, component, or
application.
- A baseline can be created at different layers: database,
Web services, etc.
- A baseline can be used as a standard for comparison to
track future optimizations or regressions. When using a baseline for this
purpose, it is important to validate that the baseline tests and results
are well understood and repeatable.
- Baselines can help product teams articulate variances that
represent degradation or optimization during the course of the development
life cycle by providing a known starting point for trend analysis.
Baselines are most valuable if created using a set of reusable test
assets; it is important that such tests are representative of workload
characteristics that are both repeatable and provide an appropriately
accurate simulation.
- Baseline results can be articulated by using combinations
of a broad set of key performance indicators such as response time,
processor, memory, disk, and network.
- Sharing baseline results across the team establishes a
common foundation of information about performance characteristics to
enable future communication about performance changes in an application or
component.
- A baseline is specific to an application and is most
useful for comparing performance across different builds, versions, or
releases.
- Establishing a baseline before making configuration
changes almost always saves time because it enables you to quickly
determine what effect the changes had on the application’s performance.
Benchmarking
Benchmarking is the process of comparing your system
performance against an industry standard that is endorsed by some other
organization.
From the perspective of Web application development,
benchmarking involves running a set of tests that comply with the
specifications of an industry benchmark to capture the performance metrics for
your application necessary to determine its benchmark score. You can then
compare your application against other systems or applications that have also
calculated their score for the same benchmark. You may choose to tune your
application performance to achieve or surpass a certain benchmark score. Some
considerations about benchmarking include:
- A benchmark score is achieved by working within industry
specifications or by porting an existing implementation to comply with
those specifications.
- Benchmarking generally requires identifying all of the
necessary components that will run together, the market where the product
exists, and the specific metrics to measure.
- Benchmark
scores can be published publicly and may result in comparisons being made
by competitors. Performance metrics that may be included along with
benchmark scores include response time, transactions processed per unit of
time, Web pages accessed per unit of time, processor usage, memory usage,
and search times.
Archive Tests
Some degree of change control or version control can be
extremely valuable for managing scripts, scenarios, and/or data changes between
each test execution, and for communicating these differences to the rest of the
team. Some teams prefer to check their test scripts, results, and reports into
the same version-control system as the build of the application to which they
apply. Other teams simply save copies into dated folders on a periodic basis,
or have their own version-control software dedicated to the performance team. It
is up to you and your team to decide what method is going to work best for you,
but in most cases archiving tests, test data, and test results saves much more
time than it takes over the course of a performance-testing project.
Additional Considerations
Consider the following additional points when creating
baselines and benchmarking:
- You can use archived test scripts, data, and results to
create the baseline for the next version of your product. Archiving this
information together with the build of the software that was tested
satisfies many auditability standards.
- In most cases, performance test scripts are improved or
modified with each new build. If you do not save a copy of the script and
identify the build it was used against, you can end up doing a lot of
extra work to get your scripts running again in the case of a build
rollback.
- With the overwhelming majority of load-generation tools,
implementing the test is a minor software-development effort in itself. While
this effort generally does not need to follow all of the team’s standards
and procedures for software development, it is a good idea to adopt a
sound and appropriately “weighted” development process for performance
scripts that complements or parallels the process your development team
employs.
Summary
Performance test execution involves activities such as
validating test environments/scripts, running the test, and generating the test
results. It can also include creating baselines and/or benchmarks of the
performance characteristics.
It is important to validate the test environment to ensure
that the environment truly represents the production environment.
Validate test scripts to check if correct metrics are being
collected, and if the test script design is correctly simulating workload
characteristics.
.gif)