Benchmarking a Transaction Engine Design


Richard Drayton

Arvindra Sehmi
Microsoft EMEA

January 2005

Summary: Performance is probably one of the least understood and misquoted metrics in the field of computing today. It is common practice amongst technologists and application vendors to regard performance as an issue that can safely be left to a tuning exercise performed at the end of a project or system implementation. In contrast, performance, as a metric for system evaluation, is considered by most users to be one of the most important and indeed critical factors for assessing the suitability of a system for a particular purpose. This paper describes a benchmarking exercise undertaken during October 2002 at Microsoft's ISV Laboratories (ISV Labs) in Redmond, Washington as part of joint project between the Capital Markets Company (Capco) and Microsoft. (25 printed pages)


The Processing Model
The Operational Benchmark
Benchmark Results
Comments and Conclusions
About the authors


Performance is probably one of the least understood and misquoted metrics in the field of computing today. It is common practice amongst technologists and application vendors to regard performance as an issue that can safely be left to a tuning exercise performed at the end of a project or system implementation. In contrast, performance, as a metric for system evaluation, is considered by most users to be one of the most important and indeed critical factors for assessing the suitability of a system for a particular purpose. This paper describes a benchmarking exercise undertaken during October 2002 at Microsoft's ISV Laboratories (ISV Labs) in Redmond, Washington as part of joint project between the Capital Markets Company (Capco) and Microsoft.

The project was initiated when Capco was commissioned by the Singapore Exchange Limited (SGX) to provide a business assessment and to develop a technical architecture for the exchange's centralized transaction processing utility. The utility was to provide matching services for post-trade, pre-settlement interactions of participants in the equities and fixed income trading areas of the Singapore market. The design of the main processing engine was done following a process known as Software Performance Engineering (SPE) [SMITH90, SMWIL02] in which the entire design and validation exercise is modeled from a performance perspective rather than from a traditional object-oriented design perspective.

Two subsystems were created by Capco, known respectively as STP Bridge (a communications infrastructure and exchange gateway) and STE (a scalable transaction engine). Both were used in the benchmarking exercise.

The Processing Model

The architecture of STE was based on loosely coupled, stateless, message processing components arranged in a queuing network for high scalability, high performance, and extremely high transaction throughputs. For SGX, a worst case transaction processing load of approximately 600 messages per second was estimated by analysis of their trading history over the previous few years; which turned out to be the highest processing level required during the previous Asian financial crisis in 1998 when exchange dealings were considered abnormally high. This value was used as the baseline/target processing load. A stretch load target of 2000 messages per second was set to ensure the architecture, if successful, would have adequate headroom to cope with future expected trading volume growth and volatility.


Figure 1.   Processing Engine Model

The decision to use loosely coupled components communicating through message queues rather than through more traditional component interfaces (APIs) requires information to be passed from component to component either directly through the messages themselves (persistent message flow) or by enriching the messages as they pass through various processing steps (temporal message flow). The processing components themselves are largely independent of each other and stateless. This results in benefits such as lower software development risk and realization costs for individual components together with higher scalability and flexibility characteristics of the processing engine as a whole when compared to traditional monolithic application development approaches. Most conventional designs support only one of the two possible scalability dimensions: scale-up —increased processing power through increased processor resources (memory, CPU, etc.), or scale-out —increased processing power through increased number of processing nodes. This architecture supports both types of scaling.

The overall architecture of the STE processing engine is shown in Figure 1.

This consists of a number of STE components which have responsibility for processing subareas of the underlying trading business activities. The business process support provided by the engine is realized by breaking down the entire trade lifecycle into a set of related atomic messages. Each market participant generates and receives various subsets of these atomic messages during execution of the trade.

An implication of the architecture is that the business process itself must be capable of being represented by fully asynchronous commutative set of operations, that is, it must be able to process messages in any order. This removes the necessity to synchronize message processing and business operations throughout the engine, a task which would result in an incredibly slow and complex application. Note that synchronization is different from correlation of messages which is applied in normal processing. Several other "autonomous computing" requirements are catered for in the processing model. Amongst these are the notions of distrusting systems, idempotent operations, state management, message correlation, context management, tentative operations, message cancellation, and transaction compensation1.

Business-level message flow through the processing engine is based, in part, on the execution model proposed by the Global Straight Through Processing Association (GSTPA), which proposed a similar centralized utility model for cross-border trade settlement. The SGX processing model was likely to require operational links to other centralized utilities like the GSTPA, so the message sets used were based, in part, on those used by the GSTPA to help with inter-operability in the future2.

The business process for the Singapore market reduced to a set of four principal message types, namely, Notice of Execution (NOE), Trade Allocation (ALC), Net Proceeds Allocation (NPR) and Settlement Instructions (SET). The process surrounding these message types involved a number of interactions leading to a set of some 35 message variants that formed the entire business operating model. As recent events in the financial services industry illustrated, the range of processing volumes that could be experienced by the utility would be comparatively large. For this reason a heavy focus was placed on the ability of the architecture to support high scalability requirements.

Modeling Performance

A valid benchmarking exercise allows other institutions to repeat the benchmark for themselves and to be able to achieve similar results. It was also important to support the arguments for the architecture model through sound mathematical techniques which would enable an assessment of the impact on performance of various implementation realization techniques a-priori to deciding on any specific realization technologies. Figure 2 shows the queuing network model used for the performance analysis in the benchmark. Note, this works hand-in-hand with a client-side Participant Access Module (PAM). By conducting the benchmark in this way it was felt that the results of the exercise, good or bad, would be credible and valid and that the results would serve as the basis for a case study into the application of queuing network models for the processing of Post-Trade, Pre-Settlement information as part of an overall high performance straight through processing (STP) initiative in the financial services industry.

Measuring Performance

One of the most frustrating aspects of performance engineering occurs when a performance unit which permits valid comparisons to be made between similar systems is not identified. Performance is a subjective quantity whose absolute value is often determined solely by the user of a system and not by any systematic algorithm or process. This nebulous aspect of performance gives rise to variations in the perceived value of performance characteristics for a specific system even without accompanying physical changes to the underlying technology whatsoever. The results of any benchmark are subject to wide interpretation and this can jeopardize any comparative analysis of software systems. Unrealizable performance is a common characteristic of benchmarks that has little relevance to a user's perspective of performance. The existence of unrealizable performance can be shown in numerous examples of published benchmarks where results often indicate high levels of performance that are simply unobtainable in the real world. For example, an ADSL connection offers a theoretical download speed of 512 K, but in reality is limited by contention with other users on the same exchange. So, a more realistic way to do performance comparisons between systems is to establish metrics for an operational situation which can be readily reproduced outside the test environment and is meaningful from the system user's perspective. To avoid these sorts of frustrations with the benchmarking process, a practical technique was required which would provide a supportable and reproducible figure for the performance of the realized system. The technique chosen was based on Buzen and Denning's work [BDEN78] allowing credible performance figures to be obtained based upon sound mathematical principles.

The Operational Benchmark

The purpose of the benchmark was to validate the architectural design on Microsoft platform technologies and to establish the credibility of the benchmark in an environment highly correlated with the operational requirements that would be presented to potential system users. This goal could only be achieved if the benchmark tests were firstly performed using volumes and process loading levels consistent with those that would be experienced in the final realized utility, and secondly performed in a manner that could be reproduced on-site as well as in the laboratory. For the laboratory exercise a set of test management components were created which allowed an end-to-end analysis to be performed. These consisted of a scalable message driver and corresponding scalable message sink with performance measurements being taken from these two points. The performance figures were calculated on the time taken to fully process a known number of messages forming a known number of financial trades.


Figure 2.   Queuing Network Model

The Buzen and Denning Method

Buzen and Denning [BDEN78] described a practical technique for evaluating the performance of queuing network software systems. In essence, they argued that a simple and easily obtainable measurement of queue length over time could be used to establish all the performance metrics which they had defined for queuing networks. A realization of the Buzen and Denning technique produces an execution graph for the network model similar to that shown in Figure 3.

A simple queuing model can be viewed as a queue together with an associated server, as shown in Figure 3 where messages arrive (arrivals) on an in-bound queue on the left-hand side, are processed by the server, and exit (completions) on the right-hand side. The Buzen and Denning technique involves sampling the queue length of the server at regular intervals and recording the number of messages remaining in the queue and continuing the observations over a period of time.

The resulting graph of observations is known as the execution graph and clearly shows the arrivals (i.e., increases in the height of the graph) and completions (i.e., decreases in the height of the graph) of messages. Using these figures, Buzen and Denning derived a set of formulae from an original result established by J.D.C. Little (known as Little's Law [LITTL61]) which enabled all of the required performance metrics to be determined. A summary of the formulae is shown in Table 1.

Table 1.   Buzen and Denning Formulae


Measuring Performance

To complete the performance measurement a standard set of business messages was needed representing the mean business process life cycle for the utility. A standard "trade" was created using an NOE (notice of execution) message, two ALC (trade allocation) messages, two NPR (net proceeds) messages and two SET (settlement instruction) messages. A standard trade therefore consisted of 7 base messages which together with the requisite acknowledgement and confirmation messages between participants comprised the entire information set for the tests. The completion of a trade's processing by the system generated one further message which was designated the CLS (clearing and settlement) message.

In the GSTPA specification a combination of up to seventeen different elements in the messages were required to match before the entire trade could be considered valid. The requirements differed depending on the particular message types being matched but at some point all seventeen matching criteria had to be applied for the trade to be processed. The full set of matching criteria is shown in Table 2 for reference.

Table 2.   Message Matching Criteria



Figure 3.   Simple Execution Graph

The matching process entailed scanning the message database matching table for all elements of the trade specified in each message as each message was received. When the scan yielded the required seven matching atomic messages a valid CLS message was sent to the out-bound queue of the test system. In addition to the matching process, the business process for the Singapore market also requires validation of the contents of the messages for items such as currency and country designations, security reference codes, payment and settlement dates and participant identification information. A collection of static data was created to form the base reference data for the utility and is shown in Table 3 for reference.

Table 3.   Static Data Parameters


To establish the benchmark result test message sets were generated using the static data. A message builder component was used to randomly select appropriate static data and then combine it into the standard trade of seven independent messages. Two core message sets were created; the first contained blocks of 250,000 trades (1,750,000 messages) and the second contained 1,000,000 trades (7,000,000 messages). A normal business cycle would require all matched or unmatched trades existing in the system after 3 days to be cleared from the system database; however, it was decided to keep all processed trade information in to investigate the degradation in database performance as volumes increased.

The message sets were stored in standard text files in XML format ready for transmission into the processing engine. To monitor the benchmark and obtain the required performance metrics a set of management components was constructed. These were a message driver component, a message sink component, and an additional component to monitor the queue lengths of both the in-bound and out-bound message queues as well as the internal queue lengths at regular intervals. The message driver component processes the test message set files sequentially and applies a digital signature prior to sending them to the in-bound queue of the processing engine.

Each message driver running on its own without contention with other processes in the system can achieve in the region of 8000 messages per second. This figure is very close to the benchmark results provided by Microsoft for the MSMQ product. A message sink component reads the information destined for the CLS queue and monitors the time taken to process a given number of messages through the system. The monitoring components are shown in Figure 4.


Figure 4.   Benchmark Monitoring Arrangement

Hardware Environment

The nature of the architecture proposed for the processing utility lends itself to higher performance where multiple machines are concerned. To remove resource contention issues within the overall architecture, it is better to have multiple single CPU machines than a single multiple CPU machine. The hardware used was 4 instances of the set of machines shown in Table 4.

Table 4.   Basic Hardware Environment (x4)


To scale-out processing during the benchmark we simply deployed multiple copies of this basic hardware environment. In addition to the hardware listed in Table 4, a further 8 single CPU machines hosted the message drivers, monitor and message sink components used to record the benchmark results. The database was put on Compaq MSA 1000 RAID (Redundant Array of Inexpensive Disks) storage device configured for level 0 support (maximum throughput, minimum recoverability). Since the business scenario for the trading exchange utility required local (i.e. client-side) database support for each participant connected to the utility, disaster recovery sites, and on-line dual system redundancy, the loss of RAID recoverability was considered a small price to pay compared to the gain in performance provided by a RAID level 0 configuration. We initially thought a single database storage device could manage the full trade transaction volumes. But it soon became apparent during execution of the benchmark tests that internal disk queuing reached excessive levels. We'll see later how this problem was avoided.

Software Environment

The software operating environment used for the various hardware platforms and support software components is listed in Table 5. Although the RTM release of Windows Server 2003 had not occurred at the time of the benchmark, the release candidate (RC1) was available and was considered to be sufficiently stable and complete to be a valid part of the benchmark. All applicable service packs were applied to the operating environment including any third party drivers used for peripheral devices.

Table 5.   Software Operating Environment Data Item Re


Scaling the Architecture

For the benchmark tests the scale-up model involved executing the software components on 4 CPU machines (an increase of 2 over the basic processing node) and also executing the code on the 64-bit Itanium processors. Although the Itanium machines contained only 2 processors, the available bus and I/O bandwidth was considerably higher than the standard 32-bit platforms and the results obtained were certainly encouraging. We did not have time to investigate the scale-up model thoroughly. The scale-out model employed was to increase the number of processing nodes from 1 through 8 processing engines and from 1 through 4 database processors. We spent the bulk of our time investigating the scale-out model.

Software Processing

The STE engine components were written in C/C++. They process XML messages using an exception-based heuristic in which messages are assumed to be correct for both content and format until such time as an error is detected (missing or incorrect data). On exception, the message in question is directed to a nominated exception queue for further processing. The subsequent exception resolution processes were not included in the scope of the benchmark tests. Validation of the content of each message was carried out against static data which had been pre-loaded into memory-based arrays. The validation process involved organizing the required static data into sorted arrays, using the C/C++ qsort function, and validating the existence of XML elements by locating an entry within the array, using the C/C++ bsearch function. Data elements in the XML messages are accessed using the standard C/C++ strstr function.

Benchmark Results

The benchmark tests produced some interesting results. Some validated the application design, while others led to architectural changes to address identified performance issues. The basic lesson we learned was that you generally need to "tune for percentages and re-architect for orders of magnitude improvements in performance."

Message Queue Processing

The benchmark test was conducted at two basic levels, the first having 250,000 trades (1.75 million messages) and the second having 1,000,000 trades (7 million messages). All processed information was left in the database as an added processing load on the system. The arrangement shown in Figure 4 was used as the basis for the benchmark evaluation. A significant latency time was noted when the message driver components were first started due mainly to contention issues within the individual queue managers processing the message streams.

The insert process rate was so high that the individual queue manager processes had insufficient time during the start of each run to transfer messages across the network to the remote processing machines. This gave the effect of the initial transfers taking several seconds before the performance monitors picked up any activity on the processing nodes of the system. Figure 5 explains this queue manager contention process. In practice it is extremely unlikely that several million messages would arrive as a single input into the utility and therefore the latency effect could be ignored for the purposes of the benchmark evaluation.


Figure 5.   Single Queue Manager Contention Process

This effect would be common to all asynchronous message transfer systems, not just MSMQ, which was used for the benchmark. A similar effect could be observed in the processing components when large numbers of messages would arrive in their in-bound queue (remote host message buffer) as a single input. This meant that the processing components would be pushed to approximately 100% loading during the initial start of a run and would stabilize to a lower figure once flow through the machines had evened out.


Figure 6.   Processor Time for a Typical Component

A typical performance monitor output for a single processing node is shown in Figure 6. The effect of the initial message burst can be clearly seen in the response curve for the combined dual processor machine.

Furthermore, the effect of processing completion of the in-bound messages flowing into the queue manager message buffer can be seen in the later part of the response graph for the host machine. Here, increasing resources (or more accurately reducing contention) would allow an increased use of processor power in the latter stages of the processing cycle. To counter both these effects of hitting message queue buffer limits too quickly and uneven processor utilization during the message injection phase, we added a larger number of in-bound queue processes and in-bound queues. This enabled the in-bound message load to be aggregated over more resources thus reducing the latency times involved.

Perhaps the biggest difference was made by using MSMQ 3.0 instead of MSMQ 2.0. The former has a 4 GB memory buffer size limit before a new buffer allocation is made, which is three orders of magnitude above the 4 MB buffer size in MSMQ 2.03.

Multi-node Processing

In the original designs of STE a single database was used. As the number of processing nodes (dual processor machines running the component software) increased, a distinct drop in the overall processing rate was noticed. This drop in processing throughput is shown in Figure 7 and was caused by contention within the database component. The cause of the contention was not due to operating system or component software problems but due to excessive disk queuing occurring in the RAID array. This meant that available bandwidth for transferring information onto the disks of the RAID array was insufficient to meet the demands made by STE's software elements.

This effect is most easily seen when examining the insertion rate of information into the database. The performance graph for the single database server is shown in Figure 8. Here, the corresponding performance graph to the one shown in Figure 7 shows the dramatic reduction in inserts into the database cause by disk queuing in the RAID array as the number of processing nodes increases. The contention for available resources caused by this queuing means that the system couldn't reasonably cope with more than 2 processing nodes in the original design. The next section discusses how this issue was overcome.


Figure 7.   Process Rate per Node


Figure 8.   Process Rate per Node (Single Database Engine)

Server Hashing Algorithm

After discussions with the Microsoft SQL Server team regarding techniques for improving the available bandwidth for disk operations, the use of a hashing algorithm was proposed, which would enable multiple database serves to be incorporated into the overall solution. The purpose of the hashing algorithm is to use a unique key on the trade derived from the message data, and a corresponding hash function which produces a single numeric value resolving to a unique instance of a database server.

A hashing algorithm was chosen to reflect our business need to always direct constituent messages of specific trades to the same database server. For the benchmark the key consisted of a subset of the matching criteria defined in Table 2. The key was constructed by concatenating a chosen subset of matching criteria values and converted to a single, very long integer number. This number was then divided by a binary value representing the number of proposed database servers (or instances) as shown in the following formula:


Using this formula4 we very effectively improved performance by federating multiple database servers (or instances) with a hash. The infrastructure architecture for the proposed utility was revised to reflect the inclusion of the multiple database solution using the hashing algorithm as shown in Figure 9. For ongoing benchmark testing support for up to 16 database servers was made within the software components, however, the system was tested to a maximum of 4 such servers. Repeating the tests using the hashing algorithm to distribute the database load across four database servers yielded very impressive results.

Click here for larger image

Figure 9.   Final Hardware Infrastructure Arrangement (click to enlarge)

The Little's Law Curve

The performance metrics determined by Buzen and Denning are based on a fundamental result produced by J.D.C. Little [LITTL61]. The generalized performance characteristic discovered by Little is shown in Figure 10.


Figure 10.   Generalized Little's Law Curve

For any queuing model, when processing a specific number of tasks the response time increases as the arrival rate (and for a balanced queuing model the completion rate) increases. The curve is characterized by having an almost linear portion in the early stages getting progressively more asymptotic as the input (and completion rate) increases.


Figure 11.   Linear Scaling with Multiple Database Instances (Single Database Engine)

The first observable results, Figure 11, showed that the increase of 400% in the available database bandwidth placed the system in the linear portion of the performance graph producing an almost linear response characteristic when the processing components are scaled from 1 to 4 nodes. In fact the measured results showed an extremely linear scaling between 1 and 4 processing nodes with only a minimal divergence from the linear model being observable. However, if the input rate is increased (in this case by increasing the number of processing nodes to 8) a divergence from the linear scaling case can be observed. This measured effect is shown in Figure 12.


Figure 12.   Scalability with Eight Processing Nodes Performance Curve

Using measured results the Little's Law curve can be drawn for the test queuing network model as shown in Figure 13. The result shows that the operational performance of the queuing model will suffer increasing reduction in performance as the number of processing nodes increases past 4 components with a significant reduction at 8 processing components.

At this point it is worth noting the scale on the left hand side of the graph in Figure 12 showing the throughput rate of the entire STE queuing model measured at some 7734 messages per second. Clearly the next scaling option to use would be to increase the number of database servers to 8 (i.e., the next available binary multiple). With this we would reasonably expect to see the message processing throughput rate reach in excess of 15,000 messages per second given a suitable increase in the number of processing nodes used.


Figure 13: The Measured Little's Law Curve

Degradation with Stored Volume

At the measured processing rate the queuing network would achieve a sustained rate in excess of 27,842,400 messages per hour or 222,739,200 per operating day. It is reasonable to ask if such a high processing rate measured over several minutes could be sustained over time. To determine the characteristics of the model as stored volumes increase, the processing load of some 2,000,000 messages was used as a base figure and a subsequent processing run of 7,000,000 messages was used to determine the effect on the overall performance as the database volumes increased. The measured degradation in message throughput with volume is shown in Figure 14.


Figure 14.   Database Degradation with Volume

Here the processing rate dropped to some 5,500 messages per second as the volume processed reached the 7,000,000 messages target. Even at this extreme level, the queuing model was achieving some 19,800,000 message per hour or 158,400,000 per operating day. The granularity of the result did not permit a more accurate measurement of the degradation effect other than the linear approximation shown here. If a finer granularity observation had been made the degradation rate would have been seen as a curve rather than a straight line indicating that a reduced degradation effect would be experienced as volumes increased further (possibly a characteristic of the paging schemas used for the B-Tree structure of modern RDBMSs).

The 7,000,000 messages processed during this test represented 1,000,000 trades processed in a relatively short period of time. It is worth noting that there are many examples of existing transaction engines in the financial services industry that have failed to reach this operating level using technology rated to a higher performance level than the Windows and Intel-based machines used in these tests.

The Buzen and Denning Results

To determine the performance of the individual components the Buzen and Denning metrics need to be determined. The monitoring process measured the length of each of the processing queues used for the queuing network model and the performance metrics were calculated. A sample result from the calculation process is shown in Table 6.

Table 6.   Sample Buzen and Denning Results Calculation


The sample shows the processing of approximately 398,573 messages through the queuing model (taken as a sample from one of two processing nodes). The host machine supporting the 7 software components (one NOE and two each of ALC, NPR and SET software modules) reached an average 93% utilization during the monitoring interval (U) according to the Windows performance monitoring tools. For the throughput calculation it must be borne in mind that there were two processing components running for each of the ALC, NPR and SET message types.

The processing network therefore achieved a mean throughput rate of approximately 1,162 messages per time unit during the test with a latency time of approximately 14.97 seconds. Latency, in this case, refers to the time difference between input and output message processing. A message entering the network will appear on the output side approximately 14.97 seconds later at the measured processing levels.


Figure 15.   Measured Performance Curve

The sample rate was set to a 7 second interval (owing to the use of an un-calibrated tick interval) and the sample data set was 500,000 messages (two messages drivers each with 250,000 messages from the standard test data set). In this case there were two process engines (or host machines) and the results shown were taken from one of those processing engines (note that the results for the ALC, NPR and SET components are aggregated across two components since the original data was measured from the message queue from which both components were being fed). The marginally higher service time (S) for the NOE messages reflects the higher processing level required of this component because of audit trail persistence and validation processing.

The performance curve for each of the processing components is shown in Figure 15 with the overall performance point marked for clarity. This is fundamentally the generalized Little's Law curve for the STE processing engine; however, this generalized view does not give all of the detail necessary to accurately predict the operating performance of the engine. It is apparent that, for the individual components there are different completion rates and therefore depending on which view of the performance metrics you wish to take there will be a different corresponding performance values. This gives rise to the Performance Operating Region (POR) for the network as shown in the shaded area of the graph in Figure 16. In this particular instance the results are reasonably close and therefore the corresponding performance operating region is narrow.

This is not always the case, however, and there are examples of systems where the POR covers a region exceeding 400% of the mean performance level. The prediction of the POR requires some complicated mathematics that were beyond the scope of this benchmark exercise; however, the effect of the POR is included here to explain the variation in the measured results during repetitive tests.


Figure 16.   The Measured Little's Law Curve

Comments and Conclusions

The impressive results of the benchmark largely speak for themselves with the overall performance, scalability and flexibility well established. The throughput rate for the overall engine certainly places it amongst some of the largest transaction engines for post trade processing infrastructure in the financial services industry. The target and stretch performance levels were exceeded with comfortable margins, and there are strong indications that the overall architectural approach will support even greater message throughputs. It is certainly worth stating that, with the achieved performance levels, current Microsoft technology offerings are capable of operating within the enterprise layer of any financial institution. Some aspects of operation within the enterprise layer, like resilience and reliability, were not tested during this benchmark and remain to be proven for this design. However, networked or grid computing-based architectures like this one have inherent characteristics to support extremely high levels of resilience and reliability. The use, therefore, of efficient grid-based processing machines and low-cost software technology would seem to be a winning combination.

Low Cost and Efficient Realization

Probably one of the more significant results of the entire benchmark process is the now proven ability of Microsoft technology to perform and scale to enterprise levels. The processing rates achieved with the queuing architecture certainly place the Microsoft operating system in the upper band for capability and scalability in networked computing. The second most significant result of the testing was the relatively low cost of implementation of the system in the Microsoft environment.

Potential Improvement Areas

In addition to the monitoring tools used to detail the benchmark results, Microsoft was able to provide additional process monitoring tools (which will be available in Visual Studio 2005) that gave a detailed view of the execution of the software elements. The Microsoft analysis tools indicated that the software components were spending on average 30%-35% of their time performing functions related to data extraction from the XML messages. This was not an overly surprising result since the main function of the software components was to validate and process string types. To access required information the C/C++ strstr search function was used and we treated the entire message as one complex string. (Note: for our problem domain this was faster than directly using an XML parser and DOM objects with XSLT.)

Although in general circumstances the use of strstr produces adequate performance levels, there are more efficient techniques that can be employed to extract information from string-based message structures. R.S. Boyer and J.S. Moore [BYMR77] described a very efficient mechanism for searching well-structured strings. The algorithm works well where the structure of the string is known in advance and is used predominantly in applications where searching large text strings is required, such as in digital library applications, editors or word processors.

For the queuing network the use of the algorithm would at first sight seem inappropriate since we have no way of determining the nature or structure of the next received message within the network. However, for the processing components the structure of the message is known since we route the message according to type for further processing. The use of the Boyer-Moore algorithm could yield improved results over the existing implementation of the network; however, the relatively small size of the XML messages (an average of 1,500 bytes per message) might be too small for the Boyer-Moore algorithm to yield results that would justify the work required to implement the algorithm.

Itanium and 64-Bit Processing

The operation of the queuing network model was tested using a beta version of SQL Server running on new (at least it was then) Itanium 64-bit hardware and the Windows operating system. Although this was not acceptable as a production benchmarking environment (because of using beta software) the results would be a useful indicator of future performance gains that could be obtained using this future Intel/Microsoft technology. On this hardware the measured throughput was an average of 872 messages per second which was considered extremely high considering the environment in which the test took place. Firstly, this result was obtained using a standard SCSI disk unit as opposed to the RAID arrays used in the main benchmark exercise. Standard SCSI performance rates would have been considerably slower than the RAID performance rates. Secondly, the Itanium database server had only two processors installed against the 8 processors used for the database engines in the benchmark. The opportunity to perform a full benchmark test within a 64-bit environment is eagerly awaited.

C# and Managed Code

The software components were also generated for use in the Microsoft C# Managed Code environment where a direct comparison could be made between the C/C++ and C# versions. As a simple test the operation of the message drivers was compared between the operating models. The process involved was fairly simple so that the effects of inefficient coding could be ignored (the actual number of active lines of code was very small). The process was to take a prepared message file and stream the data into a code loop. Processing would continue until a message separation character was received.

The resultant message was then wrapped in a standard GSTPA header and a digital signature applied to the message block. The message was then written to the message queue for the queuing network model for processing. This process continued until the entire prepared file was read. The parameter of interest to us was the throughput rate at which message could be read off the data file and queued. The results of the test are shown in Figure 17. [These results clearly show the enhanced performance of C/C++ over the managed code environment (.NET Framework 1.1).

It is also fair to point out that the results also include the comparison of efficiency of the interoperability layer between C/C++ and C# which is crossed for accessing MSMQ. At first sight it could be argued that from a performance perspective, the managed code environment should never be implemented in place of a C/C++ installation. This view, however, would be misleading since all system solutions are a compromise between cost, performance and reliability.

The overall performance results for the managed code environment reflected the test performed on the (simple) message driver component producing a throughput rate of approximately 2000 messages per second. Although this rate is around 25% of the base C/C++ level, there are definitely compensating factors that must be considered. The production of C# code is significantly more efficient than that obtainable with C/C++ code; in fact the rate at which operational code could be developed in C# was extremely impressive.

Note, a lower rate of 2000 messages per second (which is 7,200,000 messages per hour or 57,600,000 messages per day) is still considered in the upper bracket of transaction engine benchmarks and it will only get better as managed code gets faster!


Figure 17.   Comparison of Managed and Unmanaged Code

Furthermore, care must be taken when comparing managed and unmanaged code environments. The use of virtual machine environments like the Common Language Runtime (CLR) used in the .NET Framework and indeed Java/ J2EE VM-based environments can produce benchmark figures comparable with that of C/C++ code where memory-based operation are concerned. Unfortunately, such benchmark comparisons give a false picture of the overall performance levels that can be expected because most applications include elements of local and remote I/O operations and dynamic object creation and deletion.

Conversely, the ease of implementation and potential improvements in reliability and manageability may well allow managed code environment to out-perform C/C++ in the creation of applications sacrificing performance for lower costs and faster implementation times. Given that we have significant unrealizable performance (see the above discussion on this topic) with our unmanaged code implementation of the STE system —because the architecture scaled-up so remarkably well, we actually find ourselves in an opportune position to trade-off this unrealizable performance for the benefits of using managed code in future implementations of this generalized architecture. For some insights into this statement see [SEVA04].


  • [BDEN78] Buzen J.P. and Denning P.J., "The Operational Analysis of Queuing Network Models", ACM Computing surveys, 10, 3, Sept. 1978, 225-261.
  • [BYMR77] Boyer R.S, Moore J.S, 1977, "A fast string searching algorithm". Communications of the ACM. 20:762-772.
  • [LITTL61] Little J.D.C, "A Proof of the Queuing Formula L = W", Operations Research, 9, 1961.
  • [PHEL02] Pat Helland, 2002, "Autonomous computing: Fiefdoms and Emissaries", Microsoft Webcast, ondemand/892.asp
  • [SMITH90] Connie U. Smith, "Performance Engineering of Software Systems", Addison-Wesley, ISBN 0-201- 53769-9, 1990.
  • [SEVA04] Arvindra Sehmi and Clemens Vasters, FABRIQ GotDotNet Workspace:
  • [SMWIL02] Connie U. Smith and Lloyd Williams, "Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software", Addison-Wesley, ISBN 0-201-72229-1, 2002.
  • [UASDS01] Susan D. Urban, Akash Saxena, Suzanne W. Dietrich, and Amy Sundermier, "An Evaluation of Distributed Computing Options for a Rule-Based Approach to Black-Box Software Component Integration", Proceedings of the Third Int'l Workshop Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS'01), 2001.
  • [ZHES01] Huaxin Zhang and Eleni Stroulia, "Babel: Representing Business Rules in XML for Application Integration", Proceedings of the 23rd International Conference on Software Engineering (ICSE'01), 2001.

About the authors

Richard Drayton
FiS Group

Richard Drayton has been actively involved in technology architecture and design for 30 years. Recognized for his work in the field of software performance engineering, he is a practicing member of the ACM's SIGMETRICS group. For the past 15 years Richard has been active in the design and development of Financial Trading and Processing systems for many of the world's leading investment banks. Previously Richard was the head of Front Office Technology for Deutsche Bank in Frankfurt, Global Head of Architecture for Commerz Financial Products and Global Head of Architecture for Dresdner Bank. As part of his work with the IEEE and ACM, he has focused on the adaptation of high-performance computing solutions to the financial services industry with particular focus on the implementation of end-to-end STP market infrastructure solutions. He is the Executive Director responsible for technology solutions with the Financial Infrastructure Solutions Group. He holds an M.Sc. in Electronic System Design as well as degrees in Pure Mathematics and Communications Technology.

Arvindra Sehmi
Microsoft EMEA

Arvindra Sehmi is an Architect in Microsoft EMEA Developer and Platform Evangelism Group. He focuses on enterprise software-engineering best practice adoption throughout the EMEA developer and architect community and leads Architecture Evangelism in EMEA for the Financial Services Industry where his current interest is in high performance asynchronous computing architectures. Arvindra is the executive editor of JOURNAL. He holds a PhD in Bio-medical Engineering and a Masters degree in Business.


1 Pat Helland's "Autonomous computing: Fiefdoms and Emissaries" [PHEL02] webcast gives more details on the autonomous computing model.

2 The GSTPA is now defunct but this unfortunate event has no impact on the substance of this study. Additionally, the original proposal championed by the GSTPA could not be used directly in the Singapore market, so specific processing schemes were created for the SGX processing engine.

3 The individual message size is limited to 4 MB in both MSMQ 3.0 and MSMQ 2.0, but this was not a issue in our scenario.

4 S denotes the server number (base 0) in the range 0-(N-1); K denotes the very large integer value determined by the key for the message and N denotes the proposed number of database servers (clearly N must be greater than 0 and less than K). The calculation of the modulus function may seem complex; however, by choosing a binary multiple for the divisor the calculation reduces to right shifting the value of K by N bit position for N > 0. For example if N were set to 3 then the divisor would be 8 and shifting the value of K right by 3 bit positions would accomplish the required division. The remainder on division by 8 is therefore simply the value of the least significant 3 bits of K (an integer in the range 0 through 7).

This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal website.