The following sections give a summary of the performance test results, which includes highlights on throughput and latency, BizTalk Server performance, database and SQL Server performance, and network performance.
Throughput and Latency
The results show that in each configuration, decreasing the throughput (by decreasing the number of virtual users sending messages to BizTalk Server) improves latency. This relationship can be clearly shown from the results of test case 6. In that test case, 5, 7, and 8 virtual users were tested. The following figure shows the results:
Figure 24 Virtual users vs. throughput and latency
The relationship between throughput and latency is due to resource contention mainly in the MessageBox database. For a given configuration with fixed resource limits (especially CPU speed and disk I/O speed), more throughput means faster processing is needed to keep the same message latency, although if the MessageBox database computer CPU reaches its limits (or even just high utilization, for example, greater than 60 percent) or the SAN I/O starts to degrade (due to more I/O requests than it can handle and queuing if I/O starts to build up) then it will take more time to process and send messages and hence latency increases.
Computers Running BizTalk Server
To prevent resource contention, isolate the tracking service and transport adapters onto separate hosts.
Although the computers running BizTalk Server should be fast enough for processing message receiving and sending by scaling up the receiving and sending servers (by using multi-fast CPU computers) the tests show that scaling out (by adding as many computers as needed) was more important, which can be explained as an effect of parallelism.
The best results were achieved with 7 computers running BizTalk Server (2 configured for receiving and sending messages and 5 configured for sending messages only).
The other important factor was the balance between the receiving servers and the sending servers. In BizTalk Server 2004, the receipt of messages is faster and less expensive than processing and sending messages.
Tests showed that with more than 2 computers running BizTalk Server receiving and fewer than 7 computers running BizTalk Server processing and sending, a backlog of messages waiting to be processed started to accumulate, causing a significant increase in message latency. In extreme cases, the message queue in the BizTalk Message Box database increased significantly, causing processing delays of 20 seconds or more.
The following figure shows how throughput and latency improved by increasing the number of sending servers from 4 to 7 with 2 receiving servers.
Figure 25 Receiving and sending vs. throughput and latency
Monitoring the computers running BizTalk Server during testing revealed that the CPU utilization of these computers was low. On the 4 CPU computers, the average CPU utilization was around 5 percent, while the 2 CPU computers were between 30 percent and 40 percent, especially when they were used to host the receiving and sending hosts.
While in the tests 4 CPU servers were used, 2 fast CPU computers are sufficient to function as computers running BizTalk Server in this scenario. 4 CPU servers were underutilized, even during tests with high throughput and low latencies.
Database and SQL Server Performance
SQL Server performance is the most important factor for the overall performance of BizTalk Server 2004. When using BizTalk Server 2004 for messaging only (that is, where no other features such as orchestration, Human Workflow, or Business Rules Engine are used), the only BizTalk Server databases used are:
-
BizTalkMgmtDb: Used to store the configuration settings for the BizTalk group
-
SSODB: Used to store the sensitive configuration settings encrypted for the BizTalk group
-
BizTalkDTADb: Used to store the tracking data for the BizTalk group
-
BizTalkMsgBoxDb: Used to store the received messages until they are processed and sent. It contains a spool table where messages are first stored when they are received before they are processed and moved into the right application queue tables. It also stores the state of the system and message subscriptions data.
Among these databases, the BizTalkMsgBoxDb database is the busiest database at runtime and therefore the performance of the SQL server running this database is very critical to the overall performance of the BizTalk group. Each BizTalk group has at least one instance of this database and more instances can be added to scale it out.
In this testing, a separate computer was dedicated to the MessageBox database and another computer was dedicated for the other BizTalk Server databases. The third computer running SQL Server was used for a custom functionality in this scenario such as error logging from the custom pipeline components and for audit trail.
For the MessageBox database server, the most powerful server computer should be used to achieve the best performance. In this performance tuning testing, the following computers were tested for the MessageBox database:
-
Intel Xeon 8-way (HT) 3.0 GHz 12 GB RAM running the 32-bit version of the Microsoft Windows Server™ 2003 operating system and the 32-bit version of SQL Server 2000 with Service Pack 3a (SP3a)
-
AMD Opteron 4-way single-core, 2.4 GHz 16 GB RAM running the 64-bit Windows Server 2003 with SP1 and the 32-bit version of SQL Server 2000 with SP4
-
AMD Opteron 4-way dual-core, 2.2 GHz 16 GB RAM running the 64-bit version of Windows Server 2003 with SP1 and the 32-bit version of SQL Server 2000 with SP4
For the tracking database and other databases:
-
Intel Xeon 8-way (HT) 3.0 GHz 12 GB RAM running the 32-bit version of Windows Server 2003 and the 32 bit version of SQL Server 2000 with SP3a
The SQL network connectivity used was the default TCP/IP. The SQL memory setting for the Xeon computer was also the default setting (which is up to 2 GB) and for the Opteron computer it was fixed to 4 GB.
The disk performance (I/O speed and volume) for those computers was very critical. SAN storage was used and it was ultimately the bottleneck. It would have been possible to achieve better performance if the SAN was able to perform more I/O faster.
32-bit 8-Way Xeon versus 64-bit 4-Way Opteron
The following tables compare MsgBox database server performance between the 32-bit 8-Way Xeon and the 64-bit 4-Way Opteron computers from test cases 8 and 9.
Table 86 Throughput and Latency
|
Config
|
Throughput*
|
|
Request Time
|
|
ResponseTime
|
|
RoundtripTime
|
|
# of roundtrips
|
|---|
|
|
Mean
|
Median
|
Mean
|
Median
|
Mean
|
Median
|
Mean
|
Median
|
|
|
8-A
|
71
|
76
|
252
|
243
|
398
|
392
|
642
|
644
|
140111
|
|
9-A
|
128
|
123
|
979
|
329
|
1045
|
580
|
1912
|
905
|
245852
|
*All time is in messages per second.
Table 87 Percentage of CPU
|
Config
|
BPI4X-C02
|
BPI4X-O02
|
BPI4X-O03
|
BPI2X-C05
|
BPI2X-C06
|
BPI4X-A02
|
BPI8X-O01
|
BPI8X-M01
|
BPI8X-K02
|
|---|
|
8-A
|
30
|
19
|
21
|
38
|
16
|
25
|
47
|
17
|
0.07
|
|
9-A
|
45
|
45
|
47
|
22
|
16
|
73
|
73
|
20
|
0.02
|
Table 88 Average memory used (MB)
|
Config
|
BPI4X-C02
|
BPI4X-O02
|
BPI4X-O03
|
BPI2X-C05
|
BPI2X-C06
|
BPI4X-A02
|
BPI8X-O01
|
BPI8X-M01
|
BPI8X-K02
|
|---|
|
8-A
|
350
|
310
|
285
|
438
|
283
|
324
|
1987
|
2017
|
351
|
|
9-A
|
570
|
515
|
542
|
446
|
325
|
513
|
3508
|
2049
|
260
|
The results above show the following:
-
Moving from the 32-bit 8-Way Xeon to the 64-bit 4-Way Opteron increased throughput from 71 msg/s to 128 msg/s (with the same number of virtual users) and latency increased from 642 ms to 1912 ms, if you compare the mean roundtrip times, and from 644 ms to 905 ms, if you compare the median roundtrip times.
-
CPU utilization increased from 47 percent for the 32-bit 8-Way Xeon to 73 percent for the 64-bit 4-Way Opteron.
-
Memory used by the 32-bit 8-Way Xeon was approximately 2 GB while the 64-bit 4-Way Opteron used approximately 3.5 GB.
This means that higher throughput causes higher latency. The 64-bit 4-Way Opteron allowed for more messages to be received, processed, and inserted into the system, allowing the same number of virtual users to insert more work. This created more contention on the message box which causes higher latency times.
Note |
|---|
|
The physical memory used by the 32-bit 8-Way Xeon SQL Server was the default maximum value for applications on a 32-bit Microsoft Windows® operating system, which is 2 GB. For the 64-bit 4-Way Opteron SQL Server, the behavior of SQL memory management was changed to be fixed at 4 GB.
|
Number of Message Boxes
As the most contended resource within the architecture, the single BizTalk MessageBox database is an obvious performance bottleneck.
Each BizTalk group has at least one instance of this database, and more instances can be added to scale it out. This scalability feature has two important factors to consider:
-
One instance of the MessageBox databases (called the master MessageBox) is always used all the time even if the messages are stored in and retrieved from the other non-master MessageBox databases. This is because subscription processing has to be done in the master database.
-
Because when additional MessageBox databases are used to distribute the load of storing and retrieving the messages, the master MessageBox database still has to be used for subscription processing and therefore each message processing is done within a transaction that spans multiple databases (the master MessageBox database and the additional MessageBox database used for storing and retrieving it). This distributed transaction between the database servers is coordinated by the Distributed Transaction Coordinator (MSDTC) service, which adds an extra performance overhead to the message processing transaction. This extra performance overhead is higher when the additional MessageBox databases are physically on a different computer than the master MessageBox computer.
It is also important to note that although it has been shown in other benchmarks that multiple message box configuration allows for more throughput, in this case it was aimed at achieving low latency thereby limiting the ability to take advantage of the larger bandwidth that multiple message boxes produced.
In test case 19, multiple MessageBox configurations were tested. The results of configuration A with 3 MessageBox databases show that throughput decreased from 77 msg/sec to 74 msg/sec compared to test case 18 configuration D with single MessageBox database and latency increased from 288 ms compared to 302 ms. This means that the gain (if there was any) of splitting the load into 3 MessageBox databases was offset by the extra overhead of the DTC service (that is needed when using multiple MessageBox databases). Therefore, although the performance was not improved significantly, this multiple MessageBox configuration produced more consistent results (fewer spikes in latency) during the test. This can be seen by comparing the mean and median latency numbers.
In configuration D with 4 MessageBox databases on different computers running SQL Server, the results show little improvement in throughput although with similar latency.
In configuration E with 5 MessageBox databases on different computers running SQL Server, the results show a little more improvement from the previous configuration D in terms of throughput and latency, although still the performance of the single MessageBox in test case 18 configuration D (77 msg/sec throughput and 288 ms latency) is better than in this configuration (68 msg/sec throughput and 304 ms latency).
The following figure shows the results of MessageBox databases vs. throughput and latency:
Figure 26 MessageBox vs. throughput and latency
In order to assess any performance benefit that could be derived from using the dual-core Opteron processor (which has more CPU headroom than the single-core CPU), 3 MessageBox databases configuration was tested in test case 21. Additionally, each MessageBox database was serviced by a separate HBA card to spread out the load on the SAN I/O. The results of this test case configuration C (82 msg/sec throughput and 261 ms latency) show some improvement compared to test case 14 configuration C (75 msg/sec throughput and 251 ms latency) with single MessageBox.
In configuration D, the number of the virtual users was increased and the results show more throughput and slightly higher latency compared to configuration C (98 msg/sec throughput and 288 ms latency compared to configuration C (82 msg/sec throughput and 261 ms latency) which can be considered as better performance than test case 14 configuration C (75 msg/sec throughput and 251 ms latency) with single MessageBox if the extent of the improvement in throughput is compared to the degradation in latency.
Note also that during the test of this configuration, high service times were observed on the SAN I/O, which capped the performance.
Message Body Tracking
The results of the test cases 14 and 15 show the impact of message body tracking on performance.
-
Test Case 14 – Configuration A: Message body tracking enabled at 4 points (the most expensive as explained in the Tracking section under the Test Description at the beginning of this document).
-
Test Case 14 – Configuration B: Message body tracking disabled.
-
Test Case 15 – Configuration A: Message body tracking enabled at 1 point only.
The following tables show the results of three comparable configurations:
Table 89 Throughput and Latency
|
Config
|
Throughput*
|
|
Request Time
|
|
ResponseTime
|
|
RoundtripTime
|
|
# of roundtrips
|
|---|
|
|
Mean
|
Median
|
Mean
|
Median
|
Mean
|
Median
|
Mean
|
Median
|
|
|
14-A
|
79
|
83
|
104
|
100
|
211
|
210
|
317
|
312
|
152218
|
|
14-B
|
92
|
98
|
92
|
89
|
209
|
210
|
301
|
303
|
177730
|
|
15-A
|
87
|
92
|
99
|
100
|
226
|
230
|
333
|
227
|
166629
|
*All time is in messages per second.
Table 90 Percentage of CPU
|
Config
|
BPI4X-C02
|
BPI4X-O02
|
BPI4X-O03
|
BPI2X-C05
|
BPI2X-C06
|
BPI4X-A02
|
BPI8X-O01
|
BPI8X-M01
|
BPI8X-K02
|
|---|
|
14-A
|
15
|
13
|
10
|
41
|
32
|
15
|
56
|
12
|
0.03
|
|
14-B
|
18
|
13
|
12
|
n/a
|
44
|
20
|
46
|
14
|
0.04
|
|
15-A
|
18
|
10
|
14
|
n/a
|
29
|
20
|
52
|
13
|
0.06
|
Table 91 Average memory used (MB)
|
Config
|
BPI4X-C02
|
BPI4X-O02
|
BPI4X-O03
|
BPI2X-C05
|
BPI2X-C06
|
BPI4X-A02
|
BPI8X-O01
|
BPI8X-M01
|
BPI8X-K02
|
|---|
|
14-A
|
491
|
372
|
373
|
511
|
426
|
394
|
4097
|
2059
|
261
|
|
14-B
|
480
|
376
|
375
|
n/a
|
446
|
396
|
4157
|
2059
|
261
|
|
15-A
|
475
|
383
|
373
|
n/a
|
435
|
396
|
4072
|
2059
|
263
|
The results of 14-B show that the throughput was more than in 14-A and latency was lower, and that is because of message body tracking in 14-A while there was no message body tracking in 14-B.
The results of 15-A show that the throughput was more than 14-A and latency was also higher and that is because in this case message body tracking in 15-A was less expensive than 14-A and that caused the throughput to increase which caused the latency to increase as well.
As expected, when comparing 15-A (with message body tracking at 1 point) to 14 –B (with no message body tracking), the result show that message body tracking at 1 point still has a negative impact on performance.
It is also worth noting that when not having a high load, the overhead of message body tracking is not as high, although it is also not negligible.
The following figure shows the results of MessageBox tracking versus throughput latency:
Figure 27 Message Body tracking versus throughput and latency
Disk Input/Output and SAN Performance
The SAN used in the testing was from 3PAR. The 3PAR SAN was configured with 134 x 10K RPM discs. 3PAR employs 3-level virtualization based upon a mapping methodology.
The first level of mapping virtualizes physical disk drives of any capacity into a pool of uniform-sized "chunklets" (256 MB each). These fine-grained chunklets eliminate underutilization of storage assets by permitting volumes to be sized precisely and not according to large arbitrary increments. Complete system access to every chunklet eliminates large pockets of inaccessible storage. Performance is enhanced, even for small volumes, since the underlying chunklets are distributed across scores, or even hundreds, of disks.
The second level of mapping associates chunklets with Logical Disks (LDs). Logical Disks are intelligent compilations of chunklets based on RAID characteristics and the location of chunklets across the system. LDs are tailored to meet precise cost, performance, and availability characteristics. The first and second level mappings result in a massive parallelism of workloads across disks, Fibre Channel loops, and Controller Nodes. This load balancing occurs simply and automatically, eliminating the need for array planning or disk management.
The third level of mapping associates Virtual Volumes (VVs) with all or portions of an underlying LD or multiple LDs. VVs are the virtual capacity representations that are ultimately exported to hosts and applications. A VV can be coherently exported through as many or as few 3PAR InServ Storage Server ports as desired.
The following figure shows 3PAR storage virtualization:
Figure 28 Three par storage virtualization
The cabinet is configured with 2 controllers each having 8 GB RAM. Each volume was configured as RAID 1. This means that the disks are mirrored, which allowed the 3PAR to do the striping. The servers were connected by a Brocade 3800 2 G fiber channel switch. The servers used Emulex HBA cards to connect to the SAN.
3PAR provided monitoring tools to monitor the SAN behavior during testing—in particular the number of I/O operations performed on the SAN, I/O sizes, Queue Length, and the services times (being the length of time in milliseconds that the SAN takes to complete I/O transactions) in second intervals.
Using this monitoring tools, the following was observed:
-
Although the SAN I/O figures indicated that the SAN performance numbers did not reach the maximum values, meaning it was not saturated yet, there were correlations between the SAN I/O service time spikes and immediate effects on message latency.
-
These effects were especially pronounced during SQL checkpointing events under load where SQL Server synchronizes the transaction log and the data file. Therefore, during such checkpointing events, a significant effect on SAN utilization was observed, and as a knock-on effect, SQL Server was not able to process messages as quickly. This resulted in a higher latency.
These observations indicated that SAN I/O was the bottleneck that was reached because, while there was enough headroom in all other resources, when the throughput increased the latency also increased, and the SQL checkpointing events were obviously causing spikes in message latency.
To reduce or minimize the effect of SQL checkpointing (which produces high I/O for a short period every 1 minute or so) the SAN I/O performance should be improved such that the SQL performance could "ride out" the checkpointing events without adversely affecting the message latencies.
As switching off SQL checkpointing is not an option, varying the SQL checkpointing intervals did not provide an improvement due to the following:
-
Making checkpointing less frequent caused more work for SQL to do at checkpoint events, resulting in even larger spikes and larger impacts on the SAN I/O.
-
Making checkpointing more frequent caused SQL to be checkpointing more often, which resulted in multiple smaller spikes, each affecting latency adversely although to a lesser amount.
As an indication of the SAN performance and the I/O requirements, the following table shows the numbers recorded (using the SAN tools) for the virtual volume that had the MessageBox database in the tests that were bound by the SAN I/O:
Table 92 Input/Output per second
|
File
|
I/O per sec
|
|
|
Kbytes per sec
|
|
|
Svt ms
|
|
IOSz KB
|
|
|---|
|
|
Cur
|
Avg
|
Max
|
Cur
|
Avg
|
Max
|
Cur
|
Avg
|
Cur
|
Avg
|
|
Data File*
|
5320
|
338
|
5320
|
122830
|
7797
|
122830
|
3.4
|
3.4
|
23.1
|
23.1
|
|
Log File
|
999
|
745
|
1051
|
21296
|
10299
|
33934
|
0.9
|
0.8
|
21.5
|
13.7
|
The numbers above show that the service time for the data file I/O during SQL checkpoints was 3.4 ms, which is very high. The service time for I/O should be in microseconds. Even for the log file I/O, the service time of 0.9 ms is high.
It is important to note that different SAN architectures require different configurations. For the type of workload characteristic that BizTalk Server uses, it is best to have LUNs with as many fast spindles (physical disks) as possible.
Some SAN architectures allocate large spaces on fewer drives, meaning that for a given LUN there are fewer disks. For these architectures, more disk space should be allocated than is needed to ensure that more spindles are used.
Network Performance
1 GB bandwidth network connections were provided between all computers in the configuration, with the exception of the connection from the test harness computer to the computers running BizTalk Server, which was provided by 100 MB network connection.
The test harness computer (used as the load generator for LoadRunner as well as for the test harnesses Web applications) utilized, on average, less than 20 percent of the available 100 MB bandwidth.
However, the SQL Server computer running the MessageBox database had 1 GB connection, and 120 MB of the bandwidth was used under load (on average) and no discernible increase was noted. The best explanation for this is that even though the network bandwidth had more headroom, the SAN I/O limited the SQL Server performance.
The network card used was the standard Broadcom 10/100/1000 NICs that come on all HP servers. For more information, see the Hewlett Packard Web site here.