Workflow Scalability and Performance in Windows SharePoint Services 3.0
Summary: Learn how to test the performance of workflow applications and determine how to apply the results. (18 printed pages)
David Mann, Mann Software, LLC
Rohit Puri, Microsoft Corporation
Applies to: Windows SharePoint Services 3.0
Workflow is arguably one of the most exciting additions to SharePoint Products and Technologies. Like anything new, workflow is surrounded by questions. What exactly can it do? How do we use the functionality it provides to solve our business problems? Most importantly for this article, how does it perform under load and can it handle my enterprise-grade requirements?
The questions of scalability and performance are the focus of this article. We start with an overview of workflow in Windows SharePoint Services and the types of situations in which you might use it. Then we consider performance testing of workflow applications and an analysis of what those results might mean to your organization.
SharePoint workflow is designed to support and enhance human-based processes. It excels at managing the interactions among people, tasks, and content. In a very generic sense, most SharePoint workflows follow a process along these lines:
A person or process initiates some action that causes the workflow to begin. The action could be simply creating or editing content or manually initiating the workflow.
The workflow determines who needs to take the next action on the content next, and assigns that person a task.
When that task is complete or some other condition indicates that the process should continue, the workflow creates a new task, escalates the current task, or performs some similar action.
Step 3 repeats until the workflow is complete.
Yes, naturally, there are exceptions—some quite notable. But generically, the vast majority of SharePoint workflows are simply variations on this theme.
So what type of content are we talking about? Almost any type of content that can be used within a SharePoint site: documents, list items (such as announcements, calendar entries, and links), Web content—anything that can be stored in Windows SharePoint Services. In addition, with just a little customization, a SharePoint workflow can also manage content that is stored outside the Windows SharePoint Services database.
Types of Business Problems Solved
Workflow is a business tool that is uniquely suited to solving certain problems. Not all problems can or should be solved with workflow, and you can certainly get into significant trouble trying to force-fit a workflow solution to a business problem.
So what types of business problems make sense for workflow? Any of the following are certainly appropriate:
Routing (single step or multistep)
Generically, any process that involves people interacting with content, in either a loosely defined or tightly defined way, is a good candidate for a workflow. Very simple processes, such as notifications, can be handled with a workflow but are likely better candidates for built-in alerts or a simple event receiver. Processes that require no interaction between people and content can also be handled with a workflow but might be handled better with an event receiver. In specific cases, workflows that are created in Microsoft Office SharePoint Designer could be substituted for simple event receivers successfully. This approach has some potential problems, however. See the next section for more information.
Workflows vs. Event Receivers
Workflows and event receivers have many similarities, including the following:
Each can be triggered when content is added or changed.
Each can run a series of steps to complete some functional process.
Implementing each, however, is vastly different. An event receiver requires Microsoft Visual Studio and a developer; workflows can be implemented either by a developer who is using Visual Studio or by a business user working with Office SharePoint Designer.
If a developer is involved, both options are available. The developer can choose whether to create an event receiver or a workflow. The choice largely comes down to the answers to the following questions:
Does any user interaction occur?
Will the process run for a long time (more than a second or two)?
Will the process need to pause to wait for another process to complete a task?
Will the process be run many times (more than 25 or 30) concurrently?
If the answer is "Yes" for any of these questions, you should build a workflow; if "No" for all, consider an event receiver. There are certainly some exceptions, but in general these rules apply.
As with any process, running a large number of event receivers has performance implications. However, that discussion is beyond the scope of this article.
One benefit of workflows is the ability to control the impact of processing. We discuss how to achieve this with workflows shortly; event receivers offer no equivalent capability.
Creating event receivers requires the involvement of a developer. Nondevelopers (business users or administrators) can build only workflows. Furthermore, they can build workflows only by using SharePoint Designer (or another declarative workflow editor). In many cases, however, SharePoint Designer workflows can mimic simple event receivers.
Some potential issues with using SharePoint Designer workflows (related to deployment and manageability) are beyond the scope of this article. However, if used appropriately (for smaller scale, single-site/single-list workflows), SharePoint Designer can work. SharePoint Designer is an excellent tool. As with any tool, use it properly and you'll be fine; use it improperly and you'll likely have problems.
In summary, workflows and event receivers are each viable options for process automation in Windows SharePoint Services. There are situations where one is a better option than the other. Workflows are great for processes that involve system state; event receivers are great for processes that don't involve state. Workflows have more overhead for getting started, but they offer greater management control and the ability to regulate processing. Event receivers are much lighter weight and can be easier to code but require more administrative and developer experience. Furthermore, event receivers do not perform well when they must wait for longer term operations to complete.
The important point is that you must make a conscious decision based on your business needs—don't make a choice arbitrarily.
Windows Workflow Foundation, the foundation upon which SharePoint workflows are built, supports two types of built-in workflows: sequential and state machine. Although the distinction is important for architecting and developing your workflow, the paradigm you choose is not relevant to performance. Any performance distinction between a state machine and a sequential workflow is either nonexistent or negligible. In either case, the workflow is compiled into an assembly and processed according to its design.
Workflow Development Tools
As already mentioned, you can build SharePoint workflows with one of two tools:
SharePoint Designer (or another declarative workflow builder tool)
Each tool has strengths and weaknesses, which are beyond the scope of this article. Like the workflow paradigm discussed in the previous section, the tool with which you build the workflow is not relevant to the success of your workflow. Declarative workflows are not compiled initially, but after they are compiled for the first run, SharePoint Designer workflows are no different than Visual Studio sequential workflows. Any performance distinction between a workflow created in Visual Studio and a workflow created in SharePoint Designer is limited to a minor difference on the first run and is thereafter nonexistent.
Performance discussions are a statistician's dream. Ask a dozen people what performance means, and you will likely get thirteen or fourteen different answers. Give any high-school statistics student a set of numbers; even one moderately capable can slice and dice the numbers to prove nearly any theorem you posit.
Workflow performance is no different. It means different things to different people. Before we go much further, then, we should define what we mean by performance.
I contend that performance is certainly not simply transactions per second. Although running a process quickly and efficiently is important, a few extra seconds here and there are irrelevant because of one important factor: people. Although SharePoint workflow is built on the Windows Workflow Foundation, which can support nonhuman (machine-to-machine) processes, SharePoint workflow also focuses on people and their content.
And as any 13-year-old science-fiction fan can tell you, wetware (a person) is slow.
For our purposes, this fact means that our workflow engine is managing what it considers to be long-running processes. Typical human workflows execute over the course of days, weeks, or months—not nanoseconds. Even the fastest workflow process in which humans are involved—a simple, one-stage approval, for example—moves at glacial speed for a computer, even if the reviewer needs only minutes to approve the document.
This means that typical performance standards do not apply here. Honestly, for a process executing over the course of a week or a month, does it matter if it takes an extra five seconds or even a few minutes for the process to move from one stage in the process to another?
I don't think so.
What matters is how well the workflow engine can juggle.
Seriously. Think about things this way: For even a moderately busy workflow engine, a dozen defined processes that are being run by a hundred or so people, each process lasting a few days to a few weeks, means a lot of balls are in the air at the same time. A workflow engine must be able to keep all of those moving pieces straight and respond in a reasonable timeframe when any given ball is about to land. Similarly, a workflow engine needs to be able to accept a new process starting at any point in time without dropping any existing processes and without making the user wait an undue amount of time to see that the workflow started successfully.
These requirements mean that simply measuring transactions per second presents an incomplete picture of workflow performance. An equally important measurement is the number of concurrent workflows that the engine can process without dropping any and without causing problems for the environment.
In light of all of this information, we directly measured the following metrics as part of our testing:
Number of workflow starts per second (using two different methods)
Number of tasks completed per second
Number of concurrent workflows
You can read more details on each metric in the section Testing Details.
Like any other subsystem in a complex application such as Windows SharePoint Services, workflow functions well with its default settings, but you can improve and fine-tune performance by modifying specific configuration settings. In an environment that makes heavy use of workflow, you need to pay particular attention to four settings:
Workflow timer interval
A fifth setting, AutoCleanUpDays, is not as important but still bears mentioning. We examine these one at a time in the next few sections, but first you must understand how workflows are processed.
Workflow Processing Overview
Because of their nature, workflows do not process in the way that "typical" applications do. Instead, they have both a synchronous and an asynchronous nature. When they begin running, whether started manually by a user or programmatically by another process, they run synchronously. In other words, the initiating user or process waits until the initial steps of the workflow finish processing before continuing. In the user interface, this is seen as the "spinning gears" page, shown in Figure 1.
This synchronous processing continues until the first "commit point" is reached in the workflow. Although a full discussion of operations batching and commit points is beyond the scope of this paper (see Additional Resources for links to other resources on this topic), suffice it to say that any of the following activities in a workflow signal a commit point:
Any of the Delay activities (Delay, DelayFor, or DelayUntil)
Any of the Onxxx activities, such as OnTaskCreated, OnTaskChanged, or OnWorkflowItemChanged
When the commit point is reached, any workflow events that are queued because of heavy workflow load (see the next section) or delay activities are processed in background jobs. In other words, all remaining work completed by the workflow is handled as discrete tasks that are run as SPTimer jobs. These jobs run independently of other processes that are running in the SharePoint environment.
From a performance perspective, this change means two things:
The only point in our process where transactions per second really matters is from initiation until the first commit point. Because this operation is synchronous when the workflow is started manually, the user or another process is waiting. Therefore, we want the process to return as quickly as possible
After we transition to background processing (SPTimer jobs), transactions per second is not nearly as important. Instead, the aforementioned process juggling becomes far more critical as the workflow engine needs to manage the background processing.
We revisit these details throughout the rest of this paper in various ways, but most importantly in the Recommendations section.
With this understanding of workflow processing, we can examine the configuration settings with which we can manage our environment.
The workflow throttle setting controls how many workflows can be processing at any one time on the entire server farm. This setting does not control how many workflows can be "In Progress" concurrently, but rather how many can be actively using the processor. When this number is exceeded, workflow instances that are started and events that wake up dehydrated workflows are queued for later processing. The default value is 15. This setting is per farm, so the number of front-end Web servers is irrelevant.
The impact of this setting is that when a workflow starts, the number of currently active workflows is checked. If it exceeds the throttle number, the workflow is not started and instead, a timer job is created to try running the workflow later. If the number of currently active workflows is less than the throttle setting, the workflow is started.
You can check the current throttle setting by running the following command:
stsadm -o getproperty -pn workflow-eventdelivery-throttle
You can change the throttle setting by running the following command, in which you can replace 25 with the new value:
stsadm -o setproperty -pn workflow-eventdelivery-throttle -pv "25"
The throttle property exists to allow you to control how many resources workflows require in your environment. Because you cannot designate dedicated "workflow servers" the way you can for index or query servers, this mechanism prevents workflows from overrunning your environment.
Throttle is likely the most important setting to get correct in your environment. Unfortunately, there is no magic formula to calculate how you should set this property. The value is highly dependent on your environment and the details of your workflows. If your workflows are all lightweight, you can likely set a high value. (A lightweight workflow performs tasks that do not overstress the server memory or processor and that do not perform an inordinate amount of database operations. Some examples are workflows that simply create and monitor tasks.)
However, if your workflow needs to perform more heavy-duty processing—such as creating sites, iterating through collections, or performing more-intense calculations—avoid setting the throttle value too high.
The best advice is to adjust the setting and then monitor your environment. Continue to adjust and monitor until you reach an acceptable level of performance for both workflows and standard site operations.
Workflows, by their very nature, do not execute in a nonstop, linear fashion. Instead, they run for a little while, pause, run some more, and then pause again, continuing in this manner until the process is complete. Although an outside observer or a developer might disagree, workflows are a collection of batches and the workflow engine is simply a glorified batch controller.
Each workflow is broken into a series of work items, each of which is, generically, an object that represents a single scheduled operation. This operation is at a lower level than, for example, simply "create task"; several individual work items might be associated with that higher level of work. For our purposes, think of a work item as an individual call into the database or a single event to which the workflow must respond (such as wake from a delay or task modification).
Like any other batch controller, one of the tasks of the workflow engine is to moderate the number of work items that are being processed; otherwise, you run the risk of overwhelming the server resources. To that end, the batch size property controls how many work items waiting to be processed by the timer service will be executed in each run. This setting works in conjunction with the throttle property discussed earlier to keep the workflow subsystem from taking over server resources and negatively affecting real-time user interaction.
To complicate things slightly, the batch size setting plays two different roles depending on where we are in the workflow processing lifecycle. (See Workflow Processing Overview for background information on how workflows process.) Essentially, batch size comes into play in two scenarios:
Scenario 1—Immediate Execution When a workflow is started, a work item is created to process the workflow. This happens so that if the server encounters a problem before it starts processing the workflow or before it finishes the initial processing, the workflow is not lost—the timer service picks it up and processes it from the queue. However, because this is a workflow initiation, the work item is immediately moved out of the queue and processed by the W3WP process through the first dehydration point. In this scenario, the batch size setting controls how many work items are moved out of the queue for each workflow instance. Typically, there would be only one work item in this scenario: the one to process the workflow itself.
Scenario 2—Timer Job In this scenario, the workflow timer interval (see Workflow Timer Interval) has expired and the OWSTimer service is looking for workflow items to process. In this case, the batch size setting determines how many work items are processed by the timer service for all running instances. In other words, batch size specifies the parameter that is passed to the database to control how many work items of type "workflow work item" are retrieved, regardless of what workflow instance they belong to.
An example helps to clarify why this property is necessary in addition to the throttle property. Imagine a complex workflow that has multiple branches, creates many tasks (in parallel), and does a significant amount of processing. The throttle setting ensures that no more than 15 (by default) workflows are processing at any one time (remember, not "In Progress" but actively using the processor). However, a complex workflow could spawn hundreds of individual work items. Depending on the nature of the work items, the risk of overtaxing the server resources is very real. This becomes even more likely if you have increased the throttle setting and can therefore have more of these complex workflows running at one time.
To overcome this potential problem, you can get more fine-grained control by setting the batch size property. The default value is 100, meaning that each time the timer service begins processing, the first 100 workflow work items will be processed. As with the throttle property, there is no magic formula for determining the proper value. However, unless your workflows are unusually complex, the default batch size is probably acceptable, and you can get enough control with just the throttle property. If you still see problems, you need to experiment and monitor your environment to know what value is appropriate. You also need to understand the nature of the workflows that are running in your environment to understand whether one or more of the workflows are creating an exorbitant number of work items.
Problems with the batch size setting can be exhibited in one of two ways:
Processing of workflows takes too long, especially when there are time-sensitive operations. This indicates that the batch size is likely set too low. This problem is unlikely because, as we discussed, computers are operating several orders of magnitude faster than their wetware counterparts.
Workflow processing is consuming too many server resources in your farm, and the servicing of user requests or other operations are being negatively affected. This problem indicates that too many work items are being released to the timer service because the batch size is set too high.
The batch size property can be secondary to the throttle property. This fact applies only in Scenario 1 (discussed earlier), where we are dealing with initial workflow starts. If the throttle setting restricts a particular workflow instance from starting (because too many workflow instances are running), no work items for that instance are released, regardless of the batch size setting. In this case, the UI shows a status of "Starting" for the workflow instance. The workflow is run by the timer service when the throttle limit is no longer exceeded.
An important point regarding your throttle setting is that workflows that are being run by the timer service do not count against your throttle limit. From a performance perspective, then, the throttle prevents the W3WP process from being overloaded, and the batch size setting prevents the timer service from being overloaded. Working together, and in conjunction with the workflow timer interval and the workflow timeout (discussed later), these settings help to keep your environment running smoothly while still handling the work necessary for thousands of workflow instances.
You can check the current batch size setting by running the following command:
stsadm -o getproperty -pn workitem-eventdelivery-batchsize
You can change the batch size setting by running the following command, in which you can replace 125 by the new value:
stsadm -o setproperty -pn workitem-eventdelivery-batchsize -pv "125"
The timeout setting specifies the amount of time (in minutes) in which a workflow timer job must complete before it is considered to have stopped responding and is forced to stop processing. Jobs that time out are returned to the queue to be reprocessed later.
The default timeout period is five minutes, which should be sufficient for most environments. However, if your workflows require more time to start, complete tasks, or modify other workflows (especially when running under load), you must increase this property value. Understand, though, that if a workflow instance encounters a problem that causes it to wait for a response (from an external system, for example) before the first commit point, you could encounter throttle issues because that waiting workflow instance is still considered part of the count of your currently running workflows that the throttle property monitors. This condition could prevent other workflow instances from processing.
You can check the current timeout setting by running the following command:
stsadm -o getproperty -pn workflow-eventdelivery-timeout
You can change the timeout setting by running the following command, in which you can replace 10 by the new value:
stsadm -o setproperty -pn workflow-eventdelivery-timeout -pv "10"
Workflow Timer Interval
The workflow timer interval specifies how often the workflow SPTimer job fires to process pending workflow tasks. This interval also represents the granularity of delay timers within your workflow. If a timer is set to delay for one minute, but the interval timer fires only every five minutes, the workflow delays for five minutes, not one minute.
For performance considerations, if your workflow creates a lot of work items, you can use this setting, in conjunction with the batch size, to control the processing of those settings. For example, with a batch size of 100 (the default) and a timer interval of five minutes (the default), Windows SharePoint Services processes at most 100 work items every five minutes. If the batch of 100 work items for one workflow instance finishes processing in two seconds, your workflow instance is sitting idle for 4 minutes and 58 seconds. This may be acceptable; it may not. Decreasing this interval setting allows more batches to process by causing the timer to fire more often and request more work to do; but it also means that workflow processing consumes more server resources.
The minimum value for this setting is 1, which means that the timer will fire every minute.
You can check the current interval setting by running the following command, in which you replace the URL with a valid path to a SharePoint application:
stsadm -o getproperty -pn job-workflow -url http://myWssServer
You can change the interval setting by running the following command, in which value is a valid SPTimer schedule string:
stsadm -o setproperty -pn job-workflow -pv value -url http://myWssServer
You can also specify additional information about the schedule and interval for the timer service by providing a schedule string in format shown in Table 1. (Other format strings are available, but their applicability to the workflow environment is questionable.)
Table 1. Formats for SPTimer schedule strings
SPTimer schedule string format
"Every 10 minutes between 0 and 30"
Timer fires every 10 minutes from the top of the hour to half past the hour
"Hourly between 9 and 17"
Every hour from 9 A.M. to 5 P.M.
"Daily at 15:00:00"
Timer fires every day at 3 P.M.
"Monthly at 15 15:00:00"
Timer fires on the 15th of every month at 3 P.M.
Although the AutoCleanUpDays setting is not directly related to performance, it is important enough to the overall workflow environment to warrant discussion. It is also largely misunderstood and, depending on whom you ask, either does too much or not nearly enough.
The purpose of the AutoCleanUpDays setting is to remove the association between a workflow instance and its history entries and to clear out the workflow instances and tasks. It does not actually remove the entries from the History List. By default, this cleanup happens 60 days after a workflow instance has completed, but that timeframe is configurable.
The primary impact that cleanup has on performance is that the workflow status page loads faster when users browse to it because it has fewer items to query against from the task list. As mentioned previously, cleanup does not actually remove the items from the History List, so it has no impact on the 2000-item limit. A side effect of this process is that the SharePoint content database becomes less cluttered (specifically some of the workflow tables), which may help with performance in very large workflow environments where hundreds of thousands of workflow instances may be or may have been running.
You can set the AutoCleanUpDays value for a particular workflow template by adding the following nodes to the element manifest (typically workflow.xml). Replace xxxx with the appropriate number of days.
<Elements ...> <Workflow ...> <MetaData ...> <AutoCleanUpDays>xxxx</AutoCleanUpDays> </MetaData> </Workflow> </Elements>
Every new association based on this template now uses the specified AutoCleanUpDays value. Naturally, this change does not affect existing workflows, the built-in workflows, or SharePoint Designer workflows (which do not use an element manifest file). In these cases, you must run code to change the AutoCleanUpDays property of the SPWorkflowAssociation object. The challenge is that you need to run this code periodically to properly update associations that were added after you ran the code initially.
One final point on this subject is that you should not consider the Workflow History list to be an audit log. If you need that capability, there are other options.
We ran four types of tests to explore workflow performance:
Workflow manual initiation
Each test tested a specific area of workflow performance. All tests were performed on a server farm with a separate database server and with between one and eight front-end Web servers. (You can find hardware specifications for each type of machine in the Hardware and Software Configuration section). Details on each test are provided in the following sections.
There are a few important items to know about the testing that was performed:
Tests were run in isolation No other load was placed on the server. Because workflow processing cannot be segmented to a particular server or servers, this is not a real-world environment. However, our goal was to analyze workflow performance, and we wanted to keep our tests as focused as possible. We did observe overall server performance and resource utilization, which we discuss in the Interpretation of Results section.
Test environments were clean slate We recreated all test environments before running a new test instance. In this case, this consisted of creating a new SPWeb object and new SPList objects as appropriate for each test, resetting the OWSTimer service and resetting Internet Information Services (IIS). Again, these actions are not necessarily possible in a production environment. We discuss the impact of not running tabula rasa in the Interpretation of Results section.
Built-in workflow used We ran all tests by using the built-in Approval workflow. A custom workflow introduces the potential for many variables that would be impossible to track and thus make pinpointing their effects difficult. Our goal was to run tests that could be reproduced on any similar hardware to deliver reliable guidelines.
Throttling Eliminated For all test runs, we set the throttle level to an unnaturally high level—20,000—to eliminate the impact of throttling on the test results. This is not a recommendation for a production environment. As discussed earlier, you must conduct proper testing to determine the appropriate throttle setting for your environment.
Guidelines only These tests are not intended to be used directly to determine server or farm sizing requirements. They are only guidelines to assist with sizing and planning. Before implementing any enterprise production environment, sufficient testing should be completed with the specific workflows that will run in your environment. There are too many variables in any workflow to produce rules from tests such as these.
Test 1: Workflow Manual Initiation
This test simulated 25 concurrent users initiating a workflow by using the StartWorkflow Web service call. This scenario is the equivalent of users starting workflows from one of the client applications of the 2007 Office system. The goal of this test was to determine performance numbers for pure workflow starts—that is, not including user-interface processing and rendering time.
Each workflow instance created a single task when started and then dehydrated.
The environment for this test consisted of a new, single SPWeb object containing an SPList object that contained 1000 items. Each item had the default Approval workflow started by the StartWorkflow call.
Test 2: Task Completion
This test simulated 150 concurrent users marking tasks as complete by using the AlterToDo Web service call. This scenario is the equivalent of users completing tasks through the client interface of the 2007 Office system. The goal of this test was to determine performance numbers for task completion and workflow teardown.
Each workflow instance had only a single task pending so that when the task was marked as complete, the workflow instance itself was complete as well.
The environment for this test consisted of a new, single SPWeb object containing ten task lists, each containing 1000 tasks created from the built-in Approval workflow. Each task was individually updated by the AlterToDo call. The test ran against the lists in sequence, completing all 1000 items in one list before moving on to the next.
Test 3: Workflow Autostart
This test simulated 150 concurrent users creating list items via the user interface. We configured the SPList object so that the built-in Approval workflow would start automatically any time a new item was created in the list. The goal of this test was to determine the performance of workflows that started automatically via event receivers.
Each workflow instance created a single task when started and then dehydrated.
The environment for this test consisted of a new single SPWeb object containing an SPList object. The test created new items in that list and had the workflows start automatically.
This test also included the overhead of creating the new item (upon which the workflow runs) inside the SharePoint list.
Test 4: Concurrent Workflows
This test was run slightly differently from the others. In this case, we programmatically created more than 25,000 list items—1000 in each of more than 25 separate lists in a single SPWeb object. Each of the lists had the built-in Approval workflow set to start automatically on item creation. For each group of 1000 items, we created a separate task and workflow history list for the workflow that would run against them.
After all the items were created and all workflows marked as "In Progress," we reran the workflow autostart test and compared the results from these tests to the previous runs, which did not have a large number of In Progress workflows.
Before getting to an interpretation of the results, the following general statement can be made about the first three tests:
Performance increased in a nearly linear fashion from one to four front-end Web servers and then leveled off or grew at a significantly reduced rate from five to eight front-end Web servers.
The implication of this result is that workflow performance across a given number of front-end Web servers is roughly analogous to the general Windows SharePoint Services performance results (see Estimate performance and capacity requirements for Windows SharePoint Services collaboration environments (Office SharePoint Server)), which is logical as it supports the notion of workflow being a Windows SharePoint Services subsystem that is tightly integrated with its owning application. The good news is that anything you do that improves overall Windows SharePoint Services performance similarly improves workflow performance.
Interpretation of Results
Overall, results were as expected from a performance perspective and in line with other performance testing done against Windows SharePoint Services. Windows SharePoint Services proved itself capable of handling a large volume of workflows concurrently. As alluded to above, the 4x1 farm proved itself to be the optimum configuration with regard to performance for hardware invested. On a 4x1 farm, we were able to achieve the following results for 150 concurrent users:
18 workflows automatically started (at item creation) per second.
45 workflows started per second via Web service calls, such as from an Office client application. (This test was run for 25 concurrent users.)
9 workflow tasks being marked completed and workflows being completed per second via Web service calls, such as from an Office client application.
These results extrapolate to almost 65,000 workflows starting per hour (using the lowest result numbers), which is certainly an impressive number. We would love to be able to claim this level of performance. However, please do not take this extrapolation to be a statement of fact and expect that you can achieve these numbers. You won't. Well before you reach this level of performance you will encounter other constraints in Windows SharePoint Services that prevent you from approaching it.
The message to take from our performance testing is that workflow in Windows SharePoint Services is capable of meeting enterprise needs, but your specific performance is highly dependent on the details of your environment and your workflows themselves. Therefore, testing and benchmarking are critical elements of any enterprise workflow implementation.
Achieving optimum results for your specific environment is a matter of design and tuning. For information about designing high-performance workflows, see the Recommendations section. For details about tuning, see Important Workflow Configuration Settings.
Overall Resource Utilization
Across all tests, we monitored resource utilization for the farm environment as a whole. Nothing concerning or unexpected was reported for any test run. For example, during our preparation we ran several tests with the workflow throttle setting at its default value of 15. During these runs, processor utilization on the database server and the front-end Web servers each peaked at approximately 10 percent. During the regular test runs, with throttling effectively removed (set to 20,000), processor utilization increased to a peak of approximately 25 percent.
As mentioned earlier, we recreated our test environment between all official test runs to start from a clean slate. This ensured that test results across different farm configurations within a test set could be successfully compared to each other to develop a trend. For our tests, this typically consisted of creating a new SPWeb object and any necessary supporting lists, list items, or workflow associations, plus resetting the OWSTimer service and IIS prior to each test run.
We understand that this procedure is not always possible in a production environment. To assess the impact caused by running a large volume of workflows repeatedly in a non–clean slate environment, we ran some additional testing. Average degradation in this case was approximately 8–10 percent. Interestingly, the only test that saw this level of degradation was workflow manual initiation test. The other two tests were unaffected. This level was maintained regardless of how many times the tests were repeated.
The following sections provide the details of the results we obtained from our tests.
Test 1: Workflow Manual Initiation
The workflow manual initiation test was run twice per farm configuration, and the average of the two runs was used as the result. Table 2 and Figure 2 show the results. As you can see, 25 concurrent users were able to initiate 26 workflows per second from a simulated Office client on a 1x1 farm. This number increased in a linear fashion through the 4x1 farm at 45 workflow starts per second and then leveled off. The change from a 4x1 farm through the subsequent farm configurations to an 8x1 was negligible.
Workflow starts per second (average)
Test 2: Task Completion
The task completion test results showed a curve similar to the previous test results. On a 1x1 farm, three tasks were completed per second by 150 simulated concurrent users. Results increased linearly through the 4x1 farm, slowed considerably for the 5x1 farm, and continued flat through the 8x1 farm.
This test was run twice per farm configuration, and the average of the two runs was used as the result.
Tasks completed per second (average)
Test 3: Workflow Autostart
The workflow autostart test was run twice per farm configuration, and the average of the two runs was used as the result. As before, results increased in a linear fashion from the 1x1 farm to the 4x1 farm (increasing from 8 to 18 workflows started automatically), leveled off at 5x1 and 6x1, and then dropped slightly to 17 workflows started automatically for both 7x1 and 8x1.
Workflow autostarts per second (average)
Test 4: Concurrent Workflows
Testing the impact of the concurrent-workflow scenario was handled slightly differently from the others because the potential impact of a large volume of workflow instances is revealed only over time. In this case, we started 26,992 workflows (by creating the items and having the workflow autostart) and then let the farm sit untouched overnight. In the morning, we manually browsed the site looking for any signs of performance degradation. We also re-ran the workflow autostart test and compared the results to previous runs.
In our casual browsing of the site, we saw no performance degradation. The results from our rerun of the autostart test averaged a 2-percent degradation in performance (ranging from a 0.25-percent improvement to a 7.5-percent degradation).
We ran this test only on a 1x1 farm. However, the nature of workflow processing leads us to believe that performance on other farm configurations would deliver similar results.
The hardware and software of the farm servers used for these tests include the following:
Sixteen 3.2-GHz, 64-bit processors
Two Gigabit Ethernet
Microsoft SQL Server 2005 SP2 (32 bit)
Eight 2.33-GHz, 64-bit processors
Microsoft Office SharePoint Server (32 bit)
Running workflows in Windows SharePoint Services is easy. The built-in workflows can be used as-is in many situations to meet business needs. When those workflows do not meet your needs, you can create custom workflows in either Visual Studio or SharePoint Designer that are reasonably simple to develop.
Complexity occurs when you begin to push your workflow environment towards enterprise scale—managing thousands of concurrent processes across multiple documents, list items, sites, and site collections. When you move into this space, you must take some specific steps to achieve maximum performance.
Some of these tasks are required regardless of whether you are running custom or built-in workflows. They apply to any workflow and are focused more on the environment rather than the workflow itself. Because we cannot alter the core processing of built-in workflows, they require no developer knowledge to implement.
Any heavily used workflow environment should do the following:
Recycle the OWSTimer service periodically. A general recommendation is to recycle the timer service approximately every four hours in a typical SharePoint farm. If your server farm uses workflow heavily, you might need to increase this frequency. Workflows make heavy use of timer jobs, so recycling the timer service more often can provide increased performance. How often? Well, that depends on the specifics of your server farm, but certainly every three hours, two hours, or one hour is not unreasonable. Timer jobs are not lost when the service is restarted; at worst, they are delayed a few extra seconds while the service restarts.
Be cognizant of task and workflow history list usage. Remember, these are SharePoint lists and have their own recommendations for scaling—the most important of which is to keep the number of items in a "container" (the folder or root of a list) to less than 2000. When you exceed this number, performance begins to suffer. Typically, this recommendation is related to reading from the lists, but in our testing we saw performance degradation writing to lists that had approximately 5000 items. For workflow, this fact means that you should consider creating new task and history lists for every workflow association. Doing so does not totally eliminate the problem if a workflow writes to the history list often or creates an excessive number of tasks, but it will help. You should also monitor list sizes and consider trimming lists where appropriate. For the task list, some of this trimming is taken care of by the AutoCleanUp job (discussed earlier), but it is not complete and may not be sufficient. See the next portion of this section for our recommendations on getting the best performance out of custom workflows.
Repeatedly test and tune. Earlier in this article we discussed four settings that affect your workflow environment and its performance. Throughout this article, we have mentioned that you need to take the time to get these settings correct for your environment. Consider this a reminder of that fact. Performance tuning for any environment, not just workflow, is not something you do only one time. You need to determine optimum settings for each configuration option when the system is first brought online and then revisit these settings periodically—especially before new workflows are brought online or significant increases in existing workflows are made.
For custom workflows, you have more control over the actual workflow and the process it follows, so we have a few additional recommendations. Note that these are targeted mostly at workflows that are created in Visual Studio because you cannot affect many of these items when you create workflows in SharePoint Designer.
Custom Visual Studio Workflows should do the following:
Keep time from initiation to the first dehydration as short as possible. This is the period that users starting a workflow manually by way of the browser UI are watching the "spinning gears" on the Operation in Progress page, so the shorter this time lasts, the more your workflow will seem fast. Perception is reality when users are waiting for results. In addition, because work items being run by the timer service do not count against the throttle limit, the faster items can be shifted to background processing, the shorter amount of time they count against your throttle limit.
Be cognizant of task and workflow history list usage. We mentioned this item earlier, but now it takes on a slightly different meaning. First, when building a custom workflow, you should either use a logging mechanism other than the default history list or, at a minimum, significantly reduce the amount of information you write there. This could take many forms:
Write any information you need to the history list while developing and debugging, but far less information when running a release build.
Log information elsewhere, such as in the event log or the ULS log. A good guideline is to write to the History List only information that is of value to the user of the workflow. Debugging or monitoring information should be written elsewhere.
Second, your workflow should be aware of how many tasks it creates and look for ways to reduce that number. This may not be possible depending on the specifications for your process, but it should be investigated, to ensure that the process is optimized.
Dispose Properly. This recommendation is not unique to workflow development, but it is critically important, especially in a process that may be running hundreds or thousands of times concurrently: Properly dispose of your SPWeb and SPSite objects. A full discussion of this requirement is not germane to this paper, so refer to the Additional Resources section for links to information about proper disposal practices.
We covered a lot of ground in this article. By now, you should realize that SharePoint workflows can scale to run tens of thousands of processes concurrently when properly designed in a properly configured environment. There are specific guidelines to follow and specific setting that need to be managed to achieve this end, but none of them are difficult to implement.
The tests that we performed involved specific scenarios, but ones that are quite typical for workflow. Although your specific workflow is likely different from the built-in Approval workflow we tested, you can use our results as guidelines for the type of performance you can expect. Similarly, you can use the recommendations presented here to achieve the maximum performance in your specific environment and for your specific workflows.
One final recommendation to take away is that workflows are applications that live inside Windows SharePoint Services. You need to think about them in that way if you want to achieve the levels of performance necessary for enterprise environments. Poorly designed workflows or environments can significantly hamper performance and prevent you from achieving your goals. Above all else, test, test, test. It is critically important that workflows be treated like applications, and that includes performance testing in your environment.
For more information, see the following resources:
Properly disposing of SharePoint objects:
Batching and commit points in SharePoint workflow activity operations: How Windows SharePoint Services Processes Workflow Activities
Overall Windows SharePoint Services performance: Estimate performance and capacity requirements for Windows SharePoint Services collaboration environments
Blogs that cover workflow: