Export (0) Print
Expand All

BizTalk Server 2004: Performance Tuning for Low Latency Messaging

Alaeddin Mohammed, Kevin Lam

Microsoft Corporation

August 2005

Applies to: Microsoft BizTalk Server 2004

Summary: This document describes performance tuning for low latency messaging. It explains how to achieve the lowest latency possible for a messaging-only scenario using the HTTP adapter for sending and receiving. It provides details about the software and hardware configurations. (64 printed pages)

This document provides details of the software and hardware configurations used to achieve the required performance. It describes how the different factors affect performance.

This document does not describe how individual features, components, or configurations impact the overall performance of other specific deployments or scenarios. This document is intended to be a descriptive guide only; it does not provide prescriptive information or recommendations for optimizing a particular BizTalk Server 2004 deployment or scenario.

The information in this document is based on an extensive performance tuning exercise in the BizTalk Server product group labs in Redmond. The objective was to achieve an average of 300 microseconds (ms) latency with the highest possible message throughput using one BizTalk group.

The following figure shows the message flow in this scenario.

Figure 1 Message flow throughput

Aa475435.Local_-602406401_btsperftune_msgflow(en-US,BTS.10).gif

The scenario is as follows:

  1. Application A sends request messages to BizTalk Server 1-way HTTP receive port locations.
  2. BizTalk Server routes the request messages to a 2-way (solicit-response) HTTP Send Port.
  3. The 2-way HTTP Send Port sends the request messages to another Application B.
  4. Application B processes the request and returns response messages synchronously to BizTalk Server.
  5. BizTalk Server routes the response messages to a 1-way HTTP Send Port.
  6. The 1-way HTTP Send Port sends the response messages back to the requesting application A.

This document is intended for anyone who uses, or plans to use, BizTalk Server 2004. Specifically, this document is aimed at technical professionals who design, develop, or deploy applications and solutions based on BizTalk Server 2004. These professionals include:

  • Developers
  • Business users
  • Application designers
  • Technical sales staff and consultants
  • Systems integrators and analysts
  • Network engineers and technicians
  • Information technology (IT) professionals

This document assumes that readers have some experience with BizTalk Server 2004, or are familiar with emerging application integration technologies and standards. Readers should be familiar with the concepts and topics presented in the BizTalk Server 2004 product documentation, which is updated quarterly and can be accessed here.

This document is not intended for users who require assistance with using a particular feature or tool in BizTalk Server 2004; it does not contain procedures for configuring specific settings in BizTalk Server 2004, and it does not prescribe steps for deploying a particular BizTalk Server 2004 solution.

This document has five sections: Introduction, Performance Tuning for Low Latency Messaging, Test Results Analysis, Guidelines for Achieving Low Latency Messaging, and Summary.

The Introduction section provides an overview of this document and sets a context for the information that it contains.

The Performance Tuning for Low Latency Messaging section describes the various tests executed, the tuning done in each configuration, the objective, results, and the conclusion of each test case. It also shows the performance characterization for these configurations.

The Test Results Analysis section analyzes the results and sheds some light on the various components and factors involved, such as the computers running BizTalk Server, Microsoft SQL Server™ 2000, SAN storage, and network.

The Guidelines for Achieving Low Latency Messaging section provides guidelines concluded from this testing for achieving low latency messaging with BizTalk Server 2004.

The Summary section summarizes the testing results and shows some details of the best result.

This section describes the various tests executed and the tuning of each configuration to achieve the low latency performance requirements. It also shows the performance characterization for these configurations.

Software Architecture

The following figure shows the software architecture of the test environment:

Figure 2 Software architecture of the test environment

Aa475435.Local_-2113539401_btsperftune_sw_arch(en-US,BTS.10).gif

The components in this architecture are:

  • Mercury LoadRunner 8.0 as a load generator simulating one of the applications in this scenario.
  • BizTalk Server 2004 for messaging; configured as per the scenario requirements.
  • Test Harness Web Application that was built specifically for this testing using ASP.NET to simulate the other application in this scenario.
  • Another Test Harness Web Application that was built specifically for this testing using ASP.NET to simulate the first application's asynchronous receive location of the response message in this scenario.

BizTalk Server Solution Description

The BizTalk Server solution in this scenario has the following characteristics that have some effect on performance:

Pipelines. All pipelines used are custom pipelines that have custom pipeline components.

Messages. Messages are SOAP messages that have routing information in the header and the payload in the body. The request message size is 5 KB and the response message is 5 KB.

Testing Description

The Test Harness applications create and register the following performance counters:

  • Test Harness Web Application 1
    RequestTime
  • Test Harness Web Application 2
    ResponseTime
    RoundtripTime

In this architecture, the scenario is as follows:

  1. LoadRunner generates the HTTP request message, inserts a high-resolution time stamp "TimeStamp1" of the current time and posts it to the BizTalk Server 1-way HTTP Receive location.
  2. BizTalk Server receives the message and routes it to a 2-way HTTP send port.
  3. The 2-way HTTP send port posts the message to the Test Harness Web Application 1, which does the following:
    1. Records high-resolution time stamp "TimeStamp2" of the current time.
    2. Reads the time stamp from the request message "TimeStamp1."
    3. Calculates the request time as ("TimeStamp2" - "TimeStamp1") and updates the RequestTime performance counter with the calculated request time for this message.
    4. Generates a response message.
    5. Inserts the calculated RequestTime into the response message.
    6. Inserts a high-resolution time stamp "TimeStamp3" of the current time into the response message.
  4. The Test Harness Web Application 1 returns the response messages synchronously to the BizTalk Server 2-way HTTP send port.
  5. BizTalk Server routes the response message to a 1-way HTTP send port.
  6. The 1-way HTTP send port posts the message to the Test Harness Web Application 2, which does the following:
    1. Records high-resolution time at that point in time "TimeStamp4."
    2. Reads the time stamp from the response message "TimeStamp3."
    3. Calculates the response time as "TimeStamp4" - "TimeStamp3" and updates the ResponseTime performance counter with the calculated response time for this message.
    4. Reads the RequestTime from the response message.
    5. Calculates the Roundtrip time as RequestTime + ResponseTime and updates the RoundtripTime performance counter with the calculated RoundTrip for this message.

This method of time measurement excludes the time taken by the Test Harness Web Application 1 for processing the request and generating the response. To avoid time synchronization issues and get accurate time stamps, the Test Harness Web applications and LoadRunner were physically on the same computer.

The latency was measured as the roundtrip time for each message. Because the scenario required that the average message latency to be approximately 300 ms and more than 90 percent of messages to be below 500 ms latency and less than 5 percent between 500 ms and 1 sec latency, the average (mean) and the median measurements were recorded. The mean value can be highly affected by any spikes in performance, especially if the MessageBox CPU utilization is higher than 40 percent (which means high throughput) and therefore the difference between the mean and the median gives an indication of the variations or spikes.

The mean and median measurements included the ramp up and ramp down time. These numbers would have been better if the ramp up and down times were excluded.

Each test was run for 30 minuets with 5 minutes for the ramp-up and ramp-down of virtual users at the beginning and end of each test.

Tracking

The tracking requirements in this scenario are:

  • Standard DTA/HAT functionality to track Messaging events for health monitoring and for operations
  • Performance Auditing of messaging latency
  • Message Body Auditing for data accuracy

Message Body tracking is the most expensive among the tracking requirements. The other important factor is how many message bodies need tracking and at what points.

Message body tracking at the beginning of the receive pipeline and at the end of the send pipeline add extra performance overhead because the messages at those points are the wire messages that are not persisted by BizTalk Server in normal processing unless message body tracking is enabled at those points. Tracking the message bodies at the end of the receive pipeline and at the beginning of the send pipeline does not create an extra message persistence because the messages are inserted into the message box anyway.

Initially in the tests, message bodies were tracked at four points in the flow:

  1. At the beginning of the request receive pipeline at 1 in the previous diagram.
  2. At the end of the request send pipeline at 3 in the diagram above.
  3. At the beginning of the response receive pipeline at 4 in the diagram above.
  4. At the end of the response send pipeline at 6 in the diagram above.

Later in the testing starting from test case 15 the team decided that the tracking requirements can be met by tracking message bodies only at the end of outbound point in the flow. Therefore message body tracking was enabled from that test case onward only at the end of the request send pipeline at 3 in the diagram above.

Databases and SAN Storage

The databases files were stored on a Storage Area Network (SAN). The SAN used in the testing was from 3PAR. The 3PAR SAN was configured with 134 x 10K RPM discs. The 3PAR architecture does virtual striping of 256 MB "chunklets" across as many disks as would meet the total allocation (i.e., if you allocated a 1 GB virtual drive you would get "chunklets" spread across 4 disks).

The cabinet is configured with 2 controllers each having 8 GB RAM. Each volume was configured as RAID 1 which means that the disks are mirrored and that allowed the 3PAR to do the striping. The servers were connected by a Brocade 3800 2Gb fiber channel switch. The servers used Emulex HBA cards to connect to the SAN. For more information, see the 3PAR Web site here.

A number of SAN volumes were provided for each SQL database server as follows:

  • For the MessageBox database server:
    • 60 GB drive used for the BizTalkMessageBoxDb data file with 50 GB allocated
    • 20 GB drive used for the BizTalkMessageBoxDb log file with 1 GB allocated
  • For the Tracking database server:
    • 60 GB drive used for the BizTalkDTADb data file with 50 GB allocated
    • 20 GB drive used for the BizTalkDTADb log file with 1 GB allocated

Additional volumes were also provided for multiple MessageBox configurations.

Configuration Parameters Used for Low Latency Performance Tuning

The following table explains the configuration parameters used in this testing for low latency performance tuning:

Table 1 Configuration parameters used for low latency performance tuning

Parameter Location Description

BatchSize (for Messaging InProcess, Messaging Isolated, and XLANG/s)

BizTalkMgmtDb Adm_ServiceClass Table

This setting specifies the size of the batch of messages that is read from the MessageBox for sending by the sending host instance.

MaxReceiveInterval for Messaging InProcess

BizTalkMgmtDb Adm_ServiceClass Table

This represents BizTalk Server polling behavior as the interval in milliseconds when BizTalk host instance is looking for new messages that have arrived.

Every message is written to the MsgBox database and needs to be routed out from it – this routing is called de-queuing as conceptually all messages go into a queue.

Periodically the BizTalk Server processing/sending services will "poll" the MsgBox for new messages; if they find any, they will process them and immediately return to look for more. If they do not find any to process, they will wait up to this interval until trying to look again.

This is the maximum time the service will wait. If there is intermittent work it will gradually increase its wait interval up to this value.

The lower you set this value, the more work the box will need to do; hence, CPU utilization will be higher.

MaxReceiveInterval for XLANG/s

BizTalkMgmtDb Adm_ServiceClass Table

This is the same as the In Process MaxReceiveInterval – with the exception that it applies to Orchestrations.

MaxReceiveInterval for Messaging Isolated

BizTalkMgmtDb Adm_ServiceClass Table

This is the same as the In Process MaxReceiveInterval except that it applies to the Isolated Host (in our case IIS), and then only when dealing with a 2-way receive port, where the service instance for the return message will be the Isolated host.

MessagingLMBufferCacheSize

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BTSSvc.3.0]

This value determines the number of concurrent message batches that the BizTalk host instance can handle.

This value is typically increased when you see underutilization of the CPU for a receive host and you want to push messages faster into the message box. However if you make this larger, in some scenarios, you may push the message box too hard and then be throttled by the messaging agent (which is posted as an event in the event log when throttling and un-throttling on the receive side occurs), which was not occurring in our tests.

HTTPBatchSize

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BTSSvc.3.0\ HTTPReceive]

This setting specifies the batch size that the HTTP receive adapter uses to submit requests to BizTalk Server. When a message is received, the HTTP Receive adapter waits for this batch size quantity of messages to accumulate in order to submit them all at once, or until a one-second time-out has elapsed.

Setting the HTTPBatchSize value to 1 forces the HTTP receive adapter to submit each message as soon as it is received.

RequestQueueSize

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BTSSvc.3.0\ HTTPReceive]

This represents the total number of requests that the HTTP adapter will process at any one time.

MessagingThreadPoolSize

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BTSSvc.3.0{guid}]

This defines the number of threads that will be used to process messages on a per processor basis on the computer running BizTalk Server.

HTTPOutMaxConnection

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BTSSvc.3.0{guid}]

This is the maximum number of connections that each BizTalk Server HTTP Send adapter instance will use to send messages.

HTTPOutInflightSize

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BTSSvc.3.0{guid}]

This is the maximum number of concurrent HTTP requests that BizTalk Server HTTP Send adapter instance will handle.

The recommended value for latency is between 3 to 5 times that of the HttpOutMaxConnections setting.

MaxWorkerThreads

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BTSSvc.3.0{guid}]

This is the number of threads, per processor that the HTTP adapter will use to process messages (standard ASP.Net property)

MaxIOThreads

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BTSSvc.3.0{guid}]

This is the number of threads, per processor, that will run to handle I/O within the common language runtime or Microsoft .NET connection software).

HTTPOutCompleteSize

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BTSSvc.3.0{guid}]

This setting controls the number of messages being returned to BizTalk Server from the HTTP adapter regardless of whether the port is one-way or two-way. A message that is returned in the batch includes a request to DeleteMessage, MoveToNextTransport, MoveToSuspendedQ, and, most interestingly, SubmitResponseMessage.

Although this setting will improve response times in low-latency scenarios, it does increase the amount of communication between the adapter and the message box. It is the send side equivalent to the HttpBatchSize on an HTTP receive adapter.

In low-latency scenarios, this is typically set to 1 to ensure that the messages are getting processed as quickly as they are received.

Hardware Architecture

The hardware architecture varied between the tests to find the right hardware architecture to achieve the performance requirements. The following section details the test cases and describes the hardware architecture used in each test case.

All the computers were from Hewlett Packard.

Test Case 1

Objective: The objective of this test case was to establish a baseline for the performance tuning starting from evenly distributed receiving and sending host instances onto multiple computers with the tracking host instance on a separate dedicated server computer. Three BizTalk performance parameter configurations were tested in this case based on suggested values from the product group for low latency. The effect of message body tracking was also tested.

Hardware Configuration

In this test, six computers running BizTalk Server were used as five servers receiving/sending and one server running the tracking host.

The following figure shows the hardware configuration for test case 1:

Figure 3 Hardware configuration - test case 1

Aa475435.Local_322538643_btsperftune_test1(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. In the following table, the configuration parameters that varied in each configuration are highlighted in bold, and A, B, and C are different configurations.

Table 2 Software configurations for test case 1

Parameter A B C

Number of Virtual Users

20

20

20

HTTPBatchSize

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

RequestMessageSize (kb)

5

5

5

ResponseMessageSize (kb)

9

9

9

Msg Body Tracking

Y

N

N

XLANG MaximumReceiveInterval (ms)

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

BTSSvc3.0 RequestQueueSize

2048

128

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

20

20

48

BTSSvc{guid}HTTPOutInflightSize

300

75

300

MaxWorkerThreads

Default

100

100

MaxIOThreads

Default

100

100

HTTPOutCompleteSize

Default

Default

Default

Test Results

The results of test case 1 are shown in the following tables.

Table 3 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

79

n/a

472

n/a

1048

n/a

1520

n/a

n/a

B

82

n/a

390

n/a

883

n/a

1273

n/a

n/a

C

80

n/a

571

n/a

1036

n/a

1607

n/a

n/a

*All time is in messages per second.

Table 4 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

22

22

15

39

36

8

47

21

n/a

B

21

21

14

28

41

8

45

21

n/a

C

21

19

16

27

40

8

44

21

0.04

Table 5 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

n/a

n/a

n/a

n/a

n/a

n/a

n/a

n/a

n/a

B

n/a

n/a

n/a

n/a

n/a

n/a

n/a

n/a

n/a

C

563

590

563

439

571

460

2077

2016

351

Conclusion

The results show that configuration B had the best results with Message Body tracking turned off and BTSSvc3.0 RequestQueueSize set to 128 (compared to 2048 and 256 in configuration A and C) and BTSSvc{guid}HTTPOutInflightSize set to 75 (compared to 300 in the configuration A and C).

Test Case 2

Objective: The objective of this test case was to see the effect of splitting the sending and receiving functionality onto separate computers while keeping the tracking functionality on a separate computer.

Hardware Configuration

In this test case, the receiving and sending hosts were split such that the 6 computers running BizTalk Server were used as 2 Receiving, 3 Sending, and 1 Tracking. The following figure shows the hardware configuration for test case 2:

Figure 4 Hardware configuration - test case 2

Aa475435.Local_322538640_btsperftune_test2(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are highlighted in bold.

Table 6 Software configuration for test case 2

Parameter Configuration A

Number of Virtual Users

10

HTTPBatchSize

1

In-Process MaximumReceiveInterval (ms)

50

RequestMessageSize (kb)

5

ResponseMessageSize (kb)

9

Msg Body Tracking

N

XLANG MaximumReceiveInterval (ms)

50

Isolated Host MaximumReceiveInterval (ms)

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

BTSSvc3.0 RequestQueueSize

256

BTSSvc{guid} MessagingThreadPoolSize

20

BTSSvc{guid}HTTPOutMaxConnection

48

BTSSvc{guid}HTTPOutInflightSize

300

MaxWorkerThreads

100

MaxIOThreads

100

HTTPOutCompleteSize

Default

Test Results

The results of test case 2 are shown in the following tables.

Table 7 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

124

160

89303

342086

88359

535957

177662

878043

n/a

*All time is in messages per second.

Table 8 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

89

91

91

22

20

8

39

23

0.05

Table 9 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

630

678

638

374

333

433

2076

2016

351

Conclusion

The results of this test show that the throughput was much higher than in the previous test case (124 msg/sec compared to approximately 80 msg/sec), and that this caused the latency to increase dramatically. This was attributed to the ratio of the number of receiving computers compared to the number of sending computers. Therefore, in the next test case, the numbers of computers was changed to have more sending computers than receiving computers.

Test Case 3

Objective: the objective of this test case was to change the numbers of computers to have more sending computers than receiving computers.

Hardware Configuration

In this case, the six computers running BizTalk Server were used as one Receiving, four Sending, and one Tracking. The following figure shows the hardware configuration for test case 3:

Figure 5 Hardware Configuration - test case 3

Aa475435.Local_322538641_btsperftune_test3(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are highlighted in bold.

Table 10 Software configuration for test case 3

Parameter Configuration A

Number of Virtual Users

10

HTTPBatchSize

1

In-Process MaximumReceiveInterval (ms)

50

RequestMessageSize (kb)

5

ResponseMessageSize (kb)

9

Msg Body Tracking

N

XLANG MaximumReceiveInterval (ms)

50

Isolated Host MaximumReceiveInterval (ms)

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

BTSSvc3.0 RequestQueueSize

256

BTSSvc{guid} MessagingThreadPoolSize

20

BTSSvc{guid}HTTPOutMaxConnection

48

BTSSvc{guid}HTTPOutInflightSize

300

MaxWorkerThreads

100

MaxIOThreads

100

HTTPOutCompleteSize

Default

Test Results

The results of test case 3 are shown in the following tables.

Table 11 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

94

100

349

238

516

468

865

706

n/a

*All time is in messages per second.

Table 12 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

23

19

25

43

37

8

42

21

0.05

Table 13 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

405

431

407

487

341

431

2076

2016

350

Conclusion

As expected, the results show that in this configuration, the throughput was decreased from 124 msg/sec to 94 msg/sec and therefore the latency improved dramatically.

This also proves that given the same resources, throughput and latency are proportional and therefore to achieve the best results you need to get to the right balance of sending and receiving distribution.

Test Case 4

Objective: the objective of this test case was to change the number of virtual users (which translates to throughput) to achieve the lowest latency possible.

Hardware Configuration

In this case, the six computers running BizTalk Server were used as two Receiving, three Sending, and one Sending/Tracking. The following figure shows the hardware configuration for test case 4:

Figure 6 Hardware configuration - test case 4

Aa475435.Local_322538646_btsperftune_test4(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are highlighted in bold:

Table 14 Software configurations for test case 4

Parameter A B C D

Number of Virtual Users

10

12

8

5

HTTPBatchSize

1

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

50

RequestMessageSize (kb)

5

5

5

5

ResponseMessageSize (kb)

9

9

9

9

Msg Body Tracking

N

N

N

N

XLANG MaximumReceiveInterval (ms)

50

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

50

BTSSvc3.0 RequestQueueSize

256

256

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

300

300

MaxWorkerThreads

100

100

100

100

MaxIOThreads

100

100

100

100

HTTPOutCompleteSize

Default

Default

Default

Default

Test Results

The results of test case 4 are shown in the following tables.

Table 15 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

95

102

241

239

472

477

713

716

n/a

B

99

103

407

269

698

604

1105

873

n/a

C

93

102

216

208

539

423

755

631

n/a

D

83

89

152

155

285

282

437

437

n/a

*All time is in messages per second.

Table 16 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

33

n/a

26

16

16

28

43

18

0.04

B

39

n/a

26

16

16

29

44

17

0.05

C

33

n/a

21

16

16

28

41

18

0.04

D

27

n/a

17

11

16

24

35

19

0.04

Table 17 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

591

n/a

427

405

339

415

2080

2016

351

B

628

n/a

440

393

339

428

2083

2016

351

C

605

n/a

403

403

338

426

2084

2017

351

D

598

n/a

396

413

337

411

2074

2009

351

Conclusion

The results show that when throughput decreases, latency decreases and vice versa. Configuration D had the best results with the lowest throughput of 83 msg/sec (5 virtual users) where the latency dropped to 437 ms.

Test Case 5

Objective: The objective of this test case was to test the effect of the message sizes and in the last configuration (Configuration D), the HTTPOutCompleteSize configuration parameter of the HTTP adapter was introduced and changed as suggested by the product team.

Hardware Configuration

This case has the same hardware architecture in the previous test case; the six computers running BizTalk Server were used as two Receiving, three Sending, and one Sending and Tracking.

In addition, in this case, the Extreme 7i load-balancing switch was replaced with Foundry ServerIron XL Load Balancer that supports HTTP Keep-Alive. The following figure shows the hardware configuration for test case 5:

Figure 7 Hardware configuration - test case 5

Aa475435.Local_322538647_btsperftune_test5(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 18 Software configurations for test case 5

Parameter A B C* D

Number of Virtual Users

5

5

5

5

HTTPBatchSize

1

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

50

RequestMessageSize (kb)

5

5

5

5

ResponseMessageSize (kb)

9

5

5

5

Msg Body Tracking

N

N

N

N

XLANG MaximumReceiveInterval (ms)

50

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

50

BTSSvc3.0 RequestQueueSize

256

256

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

300

300

MaxWorkerThreads

100

100

100

100

MaxIOThreads

100

100

100

100

HTTPOutCompleteSize

Default

Default

Default

1

*Configurations B and C are the same but the test was repeated to make sure the results can be repeated.

Test Results

The results of test case 5 are shown in the following tables.

Table 19 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

90

97

247

163

327

281

574

444

n/a

B

90

96

172

166

273

275

445

441

n/a

C

91

97

169

162

275

282

444

444

n/a

D

85

91

162

159

219

224

381

383

n/a

*All time is in messages per second.

Table 20 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

29

27

21

12

18

28

41

19

0.04

B

28

20

17

18

12

23

42

19

0.05

C

32

24

19

18

12

27

42

19

0.06

D

30

24

20

17

10

27

43

19

0.05

Table 21 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

597

455

398

412

337

415

2074

2009

351

B

589

429

395

414

337

417

2078

2013

354

C

599

435

396

444

337

407

2109

2009

350

D

601

433

398

440

337

411

2083

2009

350

Conclusion

The results show that when the load balancer was replaced with a load balancer that supported HTTP Keep-Alive, the throughput increased from 83 msg/sec in test case 4 configuration D to 90 msg/sec and therefore the latency also increased from 437 ms to 444 ms.

In configurations B and C (which are the same although the test was repeated to make sure the results can be repeated), the response message size was reduced from 9 KB to 5 KB yet the results show that the difference is negligible.

In configuration D, the HTTPOutCompleteSize parameter was changed from the default value to 1 and the results show that the throughput decreased to 85 msg/sec compared to 90 msg/sec and latency decreased from 444 ms to 381 ms.

Aa475435.note(en-US,BTS.10).gifNote
Configuration D is the best so far (throughput 85 msg/sec, latency 381 ms) compared to the best result achieved in the previous test case 4 configuration D (throughput 83 msg/sec, latency 437 ms).

Test Case 6

Objective: The objective of this test case was to find the optimal performance with message body tracking turned on.

Hardware Configuration

This case has the same hardware architecture as in the previous test case; the 6 computers running BizTalk Server were used as two Receiving, three Sending, and one Sending and Tracking. The following figure shows the hardware configuration for test case 6:

Figure 8 Hardware configuration - test case 6

Aa475435.Local_322538644_btsperftune_test6(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 22 Software configurations for test case 6

Parameter A B C*

Number of Virtual Users

5

8

7

HTTPBatchSize

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

RequestMessageSize (kb)

5

5

5

ResponseMessageSize (kb)

5

5

5

Msg Body Tracking

Y

Y

Y

XLANG MaximumReceiveInterval (ms)

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

BTSSvc3.0 RequestQueueSize

256

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

300

MaxWorkerThreads

100

100

100

MaxIOThreads

100

100

100

HTTPOutCompleteSize

1

1

1

Test Results

The results of test case 6 are shown in the following tables.

Table 23 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

68

72

161

158

206

202

367

360

n/a

B

74

79

218

219

306

309

524

528

n/a

C

72

77

200

200

273

266

473

466

n/a

*All time is in messages per second.

Table 24 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

28

22

17

10

15

23

48

19

0.06

B

31

22

22

15

14

25

52

18

0.05

C

30

21

22

12

16

25

51

18

0.04

Table 25 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

598

437

401

444

338

409

2100

2009

351

B

618

442

397

479

338

406

2046

2009

350

C

597

447

401

481

339

423

2048

2009

350

Conclusion

The results show that when message body tracking is turned on, the throughput decreased from 85 msg/sec in test case 5 – configuration D to 68 msg/sec, and therefore latency decreased from 381 ms to 367 ms.

In configuration B and C, the number of virtual users was changed from 5 virtual users to 8 and then 7 virtual users to find the optimal performance with message body tracking turned on. The results showed again that more virtual users mean more throughput and therefore the latency increases accordingly.

Test Case 7

Objective: The objective of this test case was to see the effect of adding one more receiving instance and to compare these results to test case 6, configuration C.

Hardware Configuration

In this case, the six computers running BizTalk Server were used as two Receiving, two Sending, one Sending and Receiving, and one Sending and Tracking. The following figure shows the hardware configuration for test case 7:

Figure 9 Hardware Configuration - test case 7

Aa475435.Local_322538645_btsperftune_test7(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 26 Software configurations for test case 7

Parameter A

Number of Virtual Users

7

HTTPBatchSize

1

In-Process MaximumReceiveInterval (ms)

50

RequestMessageSize (kb)

5

ResponseMessageSize (kb)

5

Msg Body Tracking

Y

XLANG MaximumReceiveInterval (ms)

50

Isolated Host MaximumReceiveInterval (ms)

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

BTSSvc3.0 RequestQueueSize

256

BTSSvc{guid} MessagingThreadPoolSize

20

BTSSvc{guid}HTTPOutMaxConnection

48

BTSSvc{guid}HTTPOutInflightSize

300

MaxWorkerThreads

100

MaxIOThreads

100

HTTPOutCompleteSize

1

Test Results

The results of test case 7 are shown in the following tables.

Table 27 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

64

68

206

199

293

289

499

488

n/a

*All time is in messages per second.

Table 28 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

27

26

18

7

6

22

50

18

0.05

Table 29 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

598

592

399

509

337

415

2062

2012

350

Conclusion

The results of this test case compared to the previous test case show that throughput decreased from 72 msg/sec to 64 msg/sec and latency increased from 473 ms to 499 ms. This showed that this ratio of the number of receiving host instances to the number of sending host instances produced worse results.

Test Case 8 Multiple Files in the MsgBox Database Data File Filegroup

Objective: The objective of this test case was to test the effect of multiple files in the BizTalkMessageBoxDb database data file filegroup. Therefore, seven more files were added into the existing BizTalkMessageBoxDb database data file filegroup (total of eight files in the existing primary file group). Each file was created on a separate logical drive of the physical drive that was dedicated to the BizTalkMessageBoxDb data file, which is a 60 GB drive. That drive was repartitioned into eight logical drives (one primary and one secondary with seven logical drives).

Hardware Configuration

In this case, the six computers running BizTalk Server were used again as two Receiving, three Sending, and one Sending and Tracking. The following figure shows the hardware configuration for test case 8:

Figure 10 Hardware configuration - test case 8

Aa475435.Local_322538650_btsperftune_test8(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 30 Software configurations for test case 8

Parameter A

Number of Virtual Users

7

HTTPBatchSize

1

In-Process MaximumReceiveInterval (ms)

50

RequestMessageSize (kb)

5

ResponseMessageSize (kb)

5

Msg Body Tracking

Y

XLANG MaximumReceiveInterval (ms)

50

Isolated Host MaximumReceiveInterval (ms)

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

BTSSvc3.0 RequestQueueSize

256

BTSSvc{guid} MessagingThreadPoolSize

20

BTSSvc{guid}HTTPOutMaxConnection

48

BTSSvc{guid}HTTPOutInflightSize

300

MaxWorkerThreads

100

MaxIOThreads

100

HTTPOutCompleteSize

1

Test Results

The results of test case 8 are shown in the following tables.

Table 31 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

71

76

252

243

398

392

642

644

140111

*All time is in messages per second.

Table 32 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

30

19

21

38

16

25

47

17

0.07

Table 33 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

350

310

285

438

283

324

1987

2017

351

Conclusion

The results show that throughput was almost similar to the same configuration in test case configuration C although latency increased from 473 to 642 ms, which means the SQL MessageBox database multiple files filegroup for the data file had a negative impact on performance.

Test Case 9 First Opteron Test

Objective: To test the effect of replacing the Intel Xeon 8-way (HT) 3.0 GHz MessageBox database computer with the AMD Opteron 4-way 2.4 GHz 64 bit single-core computer.

Hardware Configuration

In this test case, the 6 computers running BizTalk Server were used again as 2 Receiving, 3 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 9:

Figure 11 Hardware configuration - test case 9

Aa475435.Local_322538651_btsperftune_test9(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 34 Software configurations for test case 9

Parameter A

Number of Virtual Users

7

HTTPBatchSize

1

In-Process MaximumReceiveInterval (ms)

50

RequestMessageSize (kb)

5

ResponseMessageSize (kb)

5

Msg Body Tracking

Y

XLANG MaximumReceiveInterval (ms)

50

Isolated Host MaximumReceiveInterval (ms)

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

BTSSvc3.0 RequestQueueSize

256

BTSSvc{guid} MessagingThreadPoolSize

20

BTSSvc{guid}HTTPOutMaxConnection

48

BTSSvc{guid}HTTPOutInflightSize

300

MaxWorkerThreads

100

MaxIOThreads

100

HTTPOutCompleteSize

1

Test Results

The results of test case 9 are shown in the following tables.

Table 35 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

128

123

979

329

1045

580

1912

905

245852

*All time is in messages per second.

Table 36 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

45

45

47

22

16

73

73

20

0.02

Table 37 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

570

515

542

446

325

513

3508

2049

260

Conclusion

The results show that throughput increased (almost doubled) from 72 msg/sec to 128 msg/sec compared to test case 6 configuration C, although latency also was dramatically increased from 473 ms to 1912 ms.

This meant that the AMD Opteron 4-way 2.4 GHz 64-bit computer was more powerful than the Intel Xeon 8-way (HT) 3.0 GHz MessageBox but that this ratio of the number of receiving computers to the number of sending computers resulted in more throughput and therefore higher latency.

The effect of the ratio of the number of receiving computers to the number of sending computers is that receiving is more complex than sending (receiving includes, for example, subscription matching and inserting message bodies and properties). This means that senders may "starve" if there is too much receiving because it can take away or lock resources from senders.

Therefore, having more senders is not sufficient; it is also important to have fewer receivers and/or less input from receivers to ensure that the senders can get to the messages and do their work efficiently.

In the next test cases, different hardware configurations were tested to find the optimal balance with the Opteron 4-way 64-bit SQL Server MessageBoxcomputer.

Test Case 10 Opteron

Objective: To have one more sending host instance and to reduce the number of virtual users from 7 to 5 to decrease the throughput and achieve the required latency.

Hardware Configuration

In this test case, one more sending host instance was added to one of the receiving computers so the 6 computers running BizTalk Server were used as 1 Receiving, 1 Sending & Receiving, 3 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 10.

Figure 12 Hardware configuration - test case 10

Aa475435.Local_2053840579_btsperftune_test10(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 38 Software configurations for test case 10

Parameter A

Number of Virtual Users

5

HTTPBatchSize

1

In-Process MaximumReceiveInterval (ms)

50

RequestMessageSize (kb)

5

ResponseMessageSize (kb)

5

Msg Body Tracking

Y

XLANG MaximumReceiveInterval (ms)

50

Isolated Host MaximumReceiveInterval (ms)

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

BTSSvc3.0 RequestQueueSize

256

BTSSvc{guid} MessagingThreadPoolSize

20

BTSSvc{guid}HTTPOutMaxConnection

48

BTSSvc{guid}HTTPOutInflightSize

300

MaxWorkerThreads

100

MaxIOThreads

100

HTTPOutCompleteSize

1

Test Results

The results of test case 10 are shown in the following tables.

Table 39 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

117

122

221

190

756

596

985

796

223724

*All time is in messages per second.

Table 40 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

34

36

26

25

64

48

65

16

0.05

Table 41 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

493

452

401

442

521

422

4070

2055

260

Conclusion

The results show that throughput decreased from 128 msg/sec to 117 msg/sec compared to the previous test case 6 and that latency also decreased from 1912 ms to 985 ms, and yet the latency needs to be decreased further.

Test Case 11 Opteron

Objective: To further reduce throughput to get lower latency.

Hardware Configuration

In this test case, one more sending host instance was added to the other receiving computer so the 6 computers running BizTalk Server were used as 2 Sending & Receiving, 3 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 11:

Figure 13 Hardware configuration - test case 11

Aa475435.Local_2053840578_btsperftune_test11(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 42 Software configurations for test case 11

Parameter A

Number of Virtual Users

5

HTTPBatchSize

1

In-Process MaximumReceiveInterval (ms)

50

RequestMessageSize (kb)

5

ResponseMessageSize (kb)

5

Msg Body Tracking

Y

XLANG MaximumReceiveInterval (ms)

50

Isolated Host MaximumReceiveInterval (ms)

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

BTSSvc3.0 RequestQueueSize

256

BTSSvc{guid} MessagingThreadPoolSize

20

BTSSvc{guid}HTTPOutMaxConnection

48

BTSSvc{guid}HTTPOutInflightSize

300

MaxWorkerThreads

100

MaxIOThreads

100

HTTPOutCompleteSize

1

Test Results

The results of test case 11 are shown in the following tables.

Table 43 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

100

106

146

142

264

258

405

396

191809

*All time is in messages per second.

Table 44 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

24

22

18

62

51

23

67

15

0.07

Table 45 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

470

385

374

593

457

401

4059

2055

262

Conclusion

The results show that throughput decreased to 100 msg\sec and latency also decreased to 405 ms, although still the latency need to be decreased further.

Test Case 12 Opteron

Objective: To achieve better results by adding one more sending computer.

Hardware Configuration

In this test case, an additional computer was added to the BizTalk group as a sending computer so that 7 computers running BizTalk Server were used as 2 Sending & Receiving, 4 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 12:

Figure 14 Hardware configuration - test case 12

Aa475435.Local_2053840577_btsperftune_test12(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 46 Software configurations for test case 12

Parameter A

Number of Virtual Users

5

HTTPBatchSize

1

In-Process MaximumReceiveInterval (ms)

50

RequestMessageSize (kb)

5

ResponseMessageSize (kb)

5

Msg Body Tracking

Y

XLANG MaximumReceiveInterval (ms)

50

Isolated Host MaximumReceiveInterval (ms)

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

BTSSvc3.0 RequestQueueSize

256

BTSSvc{guid} MessagingThreadPoolSize

20

BTSSvc{guid}HTTPOutMaxConnection

48

BTSSvc{guid}HTTPOutInflightSize

300

MaxWorkerThreads

100

MaxIOThreads

100

HTTPOutCompleteSize

1

Test Results

The results of test case 12 are shown in the following tables.

Table 47 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

103

108

152

139

266

260

404

399

197989

*All time is in messages per second.

Table 48 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

24

18

14

57

45

26

66

15

0.04

Table 49 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

477

415

380

587

463

407

4097

2055

261

Conclusion

The results show that throughput increased slightly from 100 msg/sec in the previous test case to 103 msg/sec and that latency stayed almost the same. This means that adding another sending computer allowed more messages to be processed, yet the overall system was taxed hard to decrease latency.

Test Case 13 Opteron

Objective: To check the effect of the balance between the number of sending and receiving host instances. To check this balance, one sending host instance was stopped in one of the sending and receiving computers.

Hardware Configuration

In this test case, the 7 computers running BizTalk Server were used as 1 Sending & Receiving, 1 Receiving, 4 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 13:

Figure 15 Hardware configuration - test case 13

Aa475435.Local_2053840576_btsperftune_test13(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 50 Software configurations for test case 13

Parameter A

Number of Virtual Users

5

HTTPBatchSize

1

In-Process MaximumReceiveInterval (ms)

50

RequestMessageSize (kb)

5

ResponseMessageSize (kb)

5

Msg Body Tracking

Y

XLANG MaximumReceiveInterval (ms)

50

Isolated Host MaximumReceiveInterval (ms)

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

BTSSvc3.0 RequestQueueSize

256

BTSSvc{guid} MessagingThreadPoolSize

20

BTSSvc{guid}HTTPOutMaxConnection

48

BTSSvc{guid}HTTPOutInflightSize

300

MaxWorkerThreads

100

MaxIOThreads

100

HTTPOutCompleteSize

1

Test Results

The results of test case 13 are shown in the following tables.

Table 51 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

111

118

158

150

307

290

459

433

213660

*All time is in messages per second.

Table 52 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

26

23

20

58

23

24

72

16

0.04

Table 53 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

489

430

384

617

322

409

4101

2056

261

Conclusion

The results show that by stopping one of the sending host instances, throughput increased from 103 msg/sec in the previous test case to 111 msg/sec and therefore latency also increased from 404 ms to 459 ms. This implies that the receiving computers were capable of receiving (that is, processing messages and inserting them to the message box), thereby creating more work for the rest of the system.

Test Case 14

Objective: To reduce the throughput by reducing the number of virtual users to achieve better latency and also to test the effect of enabling and disabling message body tracking with the AMD Opteron 4-way 2.4 GHz 64 bit single-core computer.

Hardware Configuration

In this test, the 7 computers running BizTalk Server were again used as 2 Sending & Receiving, 4 Sending, and 1 Sending & Tracking. Reducing the number of virtual users from 5 to 3 and 2 was tested. The following figure shows the hardware configuration for test case 14:

Figure 16 Hardware configuration - test case 14

Aa475435.Local_2053840583_btsperftune_test14(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 54 Software configurations for test case 14

Parameter A B C

Number of Virtual Users

3

3

2

HTTPBatchSize

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

RequestMessageSize (KB)

5

5

5

ResponseMessageSize (KB)

5

5

5

Msg Body Tracking

Y

N

N

XLANG MaximumReceiveInterval (ms)

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

BTSSvc3.0 RequestQueueSize

256

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

300

MaxWorkerThreads

100

100

100

MaxIOThreads

100

100

100

HTTPOutCompleteSize

1

1

1

Test Results

The results of test case 14 are shown in the following tables.

Table 55 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

79

83

104

100

211

210

317

312

152218

B

92

98

92

89

209

210

301

303

177730

C

75

80

75

73

176

169

251

115

144268

*All time is in messages per second.

Table 56 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

15

13

10

41

32

15

56

12

0.03

B

18

13

12

n/a

44

20

46

14

0.04

C

12

9

10

n/a

31

13

33

11

0.04

Table 57 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

491

372

373

511

426

394

4097

2059

261

B

480

376

375

n/a

446

396

4157

2059

261

C

362

285

370

n/a

361

389

4153

2059

261

Conclusion

The results of configuration A in this test case show that reducing the number of virtual users from 5 in the previous test case (case 12 configuration A) to 3 reduced the throughput from 103 msg/sec to 79 msg/sec and therefore latency decreased from 404 ms to 317 ms.

The results of configuration B in this test case show that disabling message body tracking increased the throughput compared to configuration A of this test case although again latency also increased.

The results of configuration C in this test case show that while disabling message body tracking and reducing the number of virtual users further from 3 to 2 reduced throughput from 92 msg/sec to 75 msg/sec, latency therefore decreased from 301 ms to 251 ms.

Test Case 15 Opteron Multiple Log Files for the MsgBox Database

Objective: To test the effect of splitting the MessageBox database log file into multiple files on different drives in an attempt to reduce the impact of the SQL checkpointing that results in very high SAN I/O. Also starting from this test the team decided that the tracking requirements can be met by tracking message bodies only at the end of the outbound send pipeline.

Hardware Configuration

This test case is the same as the previous test with 7 computers running BizTalk Server used as 2 Sending & Receiving, 4 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 15:

Figure 17 Hardware configuration - test case 15

Aa475435.Local_2053840582_btsperftune_test15(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 58 Software configurations for test case 15

Parameter A B

Number of Virtual Users

3

2

HTTPBatchSize

1

1

In-Process MaximumReceiveInterval (ms)

50

50

RequestMessageSize (KB)

5

5

ResponseMessageSize (KB)

5

5

Msg Body Tracking

Y*

Y*

XLANG MaximumReceiveInterval (ms)

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

BTSSvc3.0 RequestQueueSize

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

MaxWorkerThreads

100

100

MaxIOThreads

100

100

HTTPOutCompleteSize

1

1

*Msg Body tracking was changed from this test forward to track the message body at the end of the outbound send pipeline only.

Test Results

The results of test case 15 are shown in the following tables.

Table 59 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

87

92

99

100

226

230

333

227

166629

B

75

79

81

75

184

189

267

265

144387

*All time is in messages per second.

Table 60 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

18

10

14

n/a

29

20

52

13

0.06

B

16

9

11

n/a

29

16

35

11

0.03

Table 61 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

475

383

373

n/a

435

396

4072

2059

263

B

474

379

372

n/a

439

399

1916

2075

263

Conclusion

The results of this test case configuration A show that the throughput was more than the previous test case 14 configuration A and latency was also higher. This is because in this case message body tracking was less expensive and that caused the throughput to increase which caused the latency to also increase.

When comparing to test case 14 configuration B, the results show that with no tracking the throughput was higher and latency was also lower, which proves that message body tracking has a negative impact on performance.

The results of this test case configuration B show that the throughput was similar to the previous test case 14 configuration C with 2 virtual users but latency was slightly higher.

Splitting the log file into 3 files on different drives did not make a difference because SQL Server was using one log file at a time and after each checkpoint it uses another log file so the log I/O was not distributed across the 3 files simultaneously and therefore it was not split across multiple drives.

The effect of changing the message body tracking to track the message body at one point only (at the end of the outbound send pipeline) instead of the original message body tracking setting (explained in the test description section; basically tracking message bodies at 4 points) was not clear from this test case, although it was decided to continue testing with this new message body tracking setting.

Test Case 16 Dual-Core Opteron

Objective: To test the AMD dual-core Opteron 4-way 64-bit instead of the AMD single-core Opteron 4-way 64-bit CPU. Another objective was to test a greater number of virtual users but with delays between requests to ensure that having more virtual users, although sending at the same rate, does not have adverse effects on performance.

Hardware Configuration

This test case is the same as the previous test with 7 computers running BizTalk Server used as 2 Sending & Receiving, 4 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 16:

Figure 18 Hardware configuration - test case 16

Aa475435.Local_2053840581_btsperftune_test16(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 62 Software configurations for test case 16

Parameter A B C

Number of Virtual Users

2

30*

40**

HTTPBatchSize

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

RequestMessageSize (kb)

5

5

5

ResponseMessageSize (kb)

5

5

5

Msg Body Tracking

Y

Y

Y

XLANG MaximumReceiveInterval (ms)

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

BTSSvc3.0 RequestQueueSize

256

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

300

MaxWorkerThreads

100

100

100

MaxIOThreads

100

100

100

HTTPOutCompleteSize

1

1

1

*30 virtual users sending messages at a random frequency between 100 ms and 500 ms (waiting for a previous iteration to end, then waiting for a random time, then sending another message).

**40 virtual users sending messages at a fixed 500 ms delay following previous iteration (waiting for a previous iteration to end, then waiting for 500 ms, then sending another message).

Test Results

The results of test case 16 are shown in the following tables.

Table 63 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

77

82

81

80

198

204

274

277

148596

B

80

89

87

89

201

191

293

287

174913

C

72

75

87

87

183

179

271

273

141314

*All time is in messages per second.

Table 64 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

13

10

12

30

32

11

19

11

0.03

B

18

12

11

33

35

16

20

12

0.06

C

12

10

11

31

34

12

16

10

0.04

Table 65 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

484

389

373

525

396

394

4057

2076

263

B

471

385

366

603

474

393

4126

2076

263

C

477

383

365

576

369

284

4091

2077

263

Conclusion

The results of this test case show that the dual-core Opteron produced slightly more throughput (77 msg/sec compared to 75 msg/sec in the previous test case) and slightly higher latency (274 ms compared to 267 ms in the previous test case).

The most important difference shown in this test case was the CPU utilization. The dual-core had 19 percent utilization compared to 35 percent for the single core in the previous test. The other important note is that there were many fewer spikes (which can be seen in the difference between the mean and median) providing a much more consistent latency.

The other configurations B and C show that having more users with a delay produced the same effect of fewer users sending requests without any delay. The main factor here is the throughput that the users produce on the system.

Therefore, the results proved that having more virtual users although sending at the same rate did not have adverse effects on performance.

Test Case 17 Single-Core Opteron SAN Performance Tuning

Objective: To tune the SAN performance, as it was noticed that MessageBox SQL Server disk I/O reached certain limits and that further performance could not be achieved even though there were enough headroom in all other resources such as CPU, memory, and network bandwidth.

After discussions with the SAN experts it was suggested that the HBA card driver could be throttling the I/O because it has a parameter that determines the I/O request queue depth. Therefore, in this test case the queue depth parameter was changed from the default value of 32 to 64. This parameter is stored in the registry as:

[HKLM\SYSTEM\ControlSet001\Services\elxstor\Parameters\Device] "DriverParameter" = "QueueDepth=64"

To test the effect of the HBA card driver settings, the InProcess BatchSize setting was changed from the default of 20 to 100.

Hardware Configuration

This test is the same as the previous tests with 7 computers running BizTalk Server used as 2 Sending & Receiving, 4 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 17:

Figure 19 Hardware configuration - test case 17

Aa475435.Local_2053840580_btsperftune_test17(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 66 Software configurations for test case 17

Parameter A B

Number of Virtual Users

3

2

HTTPBatchSize

1

1

In-Process MaximumReceiveInterval (ms)

50

50

RequestMessageSize (kb)

5

5

ResponseMessageSize (kb)

5

5

Msg Body Tracking

Y

Y

XLANG MaximumReceiveInterval (ms)

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

BTSSvc3.0 RequestQueueSize

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

MaxWorkerThreads

100

100

MaxIOThreads

100

100

HTTPOutCompleteSize

1

1

InProcess BatchSize

100

Default

HBA driver QueueDepth

64

64

Test Results

The results of test case 17 are shown in the following tables.

Table 67 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

104

113

177

118

290

238

434

357

204435

B

76

87

88

87

183

180

270

267

157702

*All time is in messages per second.

Table 68 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

37

29

24

11

20

35

57

16

0.07

B

25

19

17

12

12

25

41

12

0.03

Table 69 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

556

433

384

399

330

413

4060

2073

264

B

508

392

378

399

280

411

4033

2073

264

Conclusion

The results of this test case configuration A show that changing the InProcess BatchSize from the default of 20 to 100 produced more throughput (104 msg/sec compared to 79 msg/sec in test case 14 configuration A) but again latency also increased (343 ms compared to 317 ms in test case 14 configuration A). This meant that SAN I/O did not improve by changing the HBA driver QueueDepth from the default of 32 to 64.

In configuration B, the InProcess BatchSize was reset to the default value of 20 and the results were similar to the results of test case 14 configuration C with slightly higher latency because in this case the tracking was enabled at 1 point, while in test case 14 configuration C tracking was disabled. So also in this test, the SAN I/O did not improve by changing the HBA driver QueueDepth from the default of 32 to 64.

Test Case 18 Single-Core Opteron SAN I/O Testing

Objective: To work around the SAN I/O performance limit by testing the following effects.

  • Disabling the MessageBox SQL agent jobs.
  • The MessageBox database data file filegroup spilt into multiple files on different drives.
  • Changing the HBA drive QueueDepth to the maximum value.
  • Changing the HBA drive QueueDepth to the maximum value with increasing the number of virtual users and increasing the values of the HTTPOutMaxConnection and the HTTPOutInflightSize parameters as suggested by the product team.
Hardware Configuration

In this test, the 7 computers running BizTalk Server were used as 2 Receiving, 4 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 18:

Figure 20 Hardware configuration - test case 18

Aa475435.Local_2053840587_btsperftune_test18(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 70 Software configurations for test case 18

Parameter A B C* D E

Number of Virtual Users

2

2

2

2

3

HTTPBatchSize

1

1

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

50

50

RequestMessageSize (kb)

5

5

5

5

5

ResponseMessageSize (kb)

5

5

5

5

5

Msg Body Tracking

Y

Y

Y

Y

Y

XLANG MaximumReceiveInterval (ms)

50

50

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

50

50

BTSSvc3.0 RequestQueueSize

256

256

256

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

48

48

128

BTSSvc{guid}HTTPOutInflightSize

300

300

300

300

650

MaxWorkerThreads

100

100

100

100

100

MaxIOThreads

100

100

100

100

100

HTTPOutCompleteSize

1

1

1

1

1

InProcess BatchSize

Default

Default

Default

Default

Default

HBA driver QueueDepth

64

64

64

2048

2048

SQLAgent Running

Y

N

Y

Y

Y

MsgBox DB data file multiple filegroup

N

N

Y

N

N

Test Results

The results of test case 18 are shown in the following tables.

Table 71 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

72

88

96

88

208

200

302

295

160263

B

75

87

93

92

217

203

319

292

158060

C

67

88

115

89

281

119

377

295

158301

D

77

89

91

88

196

192

288

284

159891

E

109

110

276

119

480

247

764

372

198862

*All time is in messages per second.

Table 72 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

22

21

14

12

12

26

46

12

0.03

B

29

20

14

12

12

25

43

12

0.04

C

24

20

14

12

11

26

41

12

0.04

D

21

22

15

12

11

27

40

13

0.05

E

35

28

23

10

20

37

53

16

0.03

Table 73 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

529

401

377

458

299

414

4044

2073

265

B

570

546

414

456

300

425

4082

2073

265

C

518

396

375

430

300

448

3735

2073

265

D

512

400

379

436

270

417

2604

2073

265

E

527

425

383

426

318

415

4048

2073

266

Conclusion

Configuration A was done as a reference for this test case and the configurations will be compared to it to see the effect of those configurations.

The results of configuration B compared to configuration A show that disabling the MessageBox jobs had little effect by increasing the throughput slightly and that accordingly the latency also increased slightly.

The results of configuration C (with multiple files in the MessageBox data file filegroup) show that throughput decreased and latency increased compared to configuration A which means that the impact of multiple files in the MessageBox data file filegroup was negative.

The results of configuration D (with the HBA driver QueueDepth set to the maximum value) show improvement compared to configuration A (throughput increased from 72 msg/sec to 77 msg/sec and latency decreased from 302 ms to 288 ms) which proved that the SAN I/O was the bottleneck in configuration A.

The results of configuration E (with the HBA driver QueueDepth set to the maximum and increasing the number of virtual users and increasing the values of the HTTPOutMaxConnection and the HTTPOutInflightSize parameters as suggested by the product team to push the SAN I/O further) show that the throughput increased but the latency also increased which can be explained as these settings did not improve the SAN I/O under this higher throughput.

Test Case 19 Single-Core Opteron Multiple MessageBox databases

Objective: To work around the SAN I/O performance limit by testing the effect of distributing the MessageBox databases onto multiple databases with multiple HBA cards and furthermore onto multiple computers. The idea was to split the MessageBox database I/O on the SAN to go through multiple HBA cards to reduce the contention through one HBA card as it was believed that the SAN itself was still capable of doing more I/O throughput without increased latency.

Hardware Configuration

In this test, the 7 computers running BizTalk Server were used as 2 Receiving, 4 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 19:

Figure 21 Hardware configuration - test case 19

Aa475435.Local_2053840586_btsperftune_test19(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 74 Software configurations for test case 19

Parameter A B C* D E

Number of Virtual Users

2

2

2

2

3

HTTPBatchSize

1

1

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

50

50

RequestMessageSize (kb)

5

5

5

5

5

ResponseMessageSize (kb)

5

5

5

5

5

Msg Body Tracking

Y

Y

Y

Y

Y

XLANG MaximumReceiveInterval (ms)

50

50

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

50

50

BTSSvc3.0 RequestQueueSize

256

256

256

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

48

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

300

300

300

MaxWorkerThreads

100

100

100

100

100

MaxIOThreads

100

100

100

100

100

HTTPOutCompleteSize

1

1

1

1

1

InProcess BatchSize

20

20

20

20

20

HBA driver QueueDepth

2048

2048

2048

2048

2048

No. of HBA cards on the Master MsgBox computer

1

2

2

2

2

No. of MsgBox databases on the Master MsgBox computer

3

3

2

2

2

No. of MsgBox databases on the DTA computer

0

0

1

1

2

No. of MsgBox databases on the 3rd SQL Server computer

0

0

0

1

1

No. of Total MsgBox databases

3

3

3

4

5

Master MsgBox Publication enabled

N

N

N

N

N

Test Results

The results of test case 19 are shown in the following tables.

Table 75 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

74

74

94

93

204

200

302

298

134368

B

73

72

96

95

208

206

306

302

132213

C

58

58

103

103

202

196

302

292

119002

D

66

66

101

101

211

207

315

307

119742

E

68

68

98

96

205

210

304

307

123499

*All time is in messages per second.

Table 76 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

20

16

15

17

8

22

67

11

0.04

B

20

16

15

16

8

22

64

11

0.03

C

19

14

13

5

13

20

62

29

0.07

D

20

14

13

13

4

20

59

23

14.00

E

19

15

14

5

14

20

62

21

11.00

Table 77 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

541

395

382

443

302

406

4096

2074

266

B

536

394

382

444

271

398

4023

2044

273

C

542

395

388

425

321

402

4090

2064

273

D

546

391

379

448

303

399

4055

2065

289

E

511

392

379

397

323

402

4073

2070

1538

Conclusion

The results of configuration A show that throughput decreased (74 msg/sec compared to 77 msg/sec in test case 18 configuration D) and latency increased (302 ms compared to 288 ms in test case 18 configuration D). This means that the gain (if there was any) of splitting the load into multiple MessageBox databases was offset by the extra overhead of the DTC service. (This is needed with multiple MessageBox database configurations because processing messages involves more than one database. It involves the master MessageBox database for updating subscriptions and the non-mater MessageBox database for storing and retrieving the messages themselves.)

In configuration B, where one more HBA card was added to the MessageBox computer so that the 3 MessageBox databases' data and log files were moved into 3 different drives that were split between the 2 HBA cards (Master MsgBox database + 1 non-master MessageBox database on 2 drives attached to one of the two HBA cards and the third MsgBox database on another drive attached to the second HBA card). The results of this configuration were almost similar to configuration A, which means splitting the I/O into multiple HBA cards did not improve the performance. Again this proved that the SAN I/O could not be improved.

In configuration C, the non-master MsgBox database that was on the drive that was attached to the HBA card that's shared with the drive of the master MsgBox database was moved to the DTA SQL computer to further spilt the SAN I/O. Also the number of virtual users was increased from 2 to 3 to increase the load on the system. The results of this configuration show that the throughput decreased (because of the extra overhead of the DTC service across computers this time) and latency did not improve. This also means that this scaling and distribution of the MsgBox load was offset by the DTC overhead and most likely the SAN I/O could not be improved even this way.

In configuration D, one more MsgBox database was added to the third SQL Server computer (total of 4 MsgBox databases) to further scale out the MsgBox load and further distribute the SAN I/O across multiple channels or HBA cards. The results of this configuration show little improvement in throughput compared to the previous configuration C but with similar latency.

In configuration E, one more MsgBox database was added to the DTA SQL computer so the total number of the MsgBox databases was increased to 5 to further scale out the MsgBox I/O load. The results of this configuration show little more improvement from the previous configuration D in terms of throughput and latency but the performance of the single MsgBox in test case 18 configuration D (77 msg/sec throughput and 288 ms latency) is still better than in this configuration (68 msg/sec throughput and 304 ms latency).

Test Case 20 Single–Core Opteron SAN Performance Tuning

Objective: Because it was evident that the SAN performance was the bottleneck, the objective of this test case was to try different SAN HBA card driver settings suggested by the SAN experts trying to improve or work around the I/O performance limits.

Hardware Configuration

This test is the same as the previous test case 14 where 7 computers running BizTalk Server were used as 2 Sending & Receiving, 4 Sending, and 1 Sending & Tracking. The following figure shows the hardware configuration for test case 20:

Figure 22 Hardware configuration - test case 20

Aa475435.Local_2053840544_btsperftune_test20(en-US,BTS.10).gif
Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 78 Software configurations for test case 20

Parameter A B C

Number of Virtual Users

3

4

4

HTTPBatchSize

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

RequestMessageSize (kb)

5

5

5

ResponseMessageSize (kb)

5

5

5

Msg Body Tracking

Y

Y

N*

XLANG MaximumReceiveInterval (ms)

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

BTSSvc3.0 RequestQueueSize

256

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

300

MaxWorkerThreads

100

100

100

MaxIOThreads

100

100

100

HTTPOutCompleteSize

1

1

1

InProcess BatchSize

50

50

60

HBA driver QueueDepth

128

128

128

HBA driver QueueTarget

1

1

1

HBA driver NumFCPContacts

2048

2048

2048

HBA driver CoalesceMSCount

0

0

0

*Global tracking turned off.

Test Results

The results of test case 20 are shown in the following tables.

Table 79 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

95

96

95

91

206

203

307

307

173463

B

109

109

108

108

235

235

349

349

198740

C

126

126

116

223

239

235

357

356

229717

*All time is in messages per second.

Table 80 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

18

12

16

35

50

16

n/a

14

0.14

B

22

16

17

54

57

21

n/a

16

0.16

C

19

30

16

44

42

24

n/a

n/a

0.14

Table 81 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

544

406

405

570

452

413

n/a

2089

1951

B

552

432

394

615

640

418

n/a

2090

1951

C

424

562

391

614

610

431

n/a

2090

1952

Conclusion

The results of configuration A show improvement where throughput increased and latency decreased (95 msg/sec throughput and 307 ms latency compared to the test case 14 configuration A of 79 msg/sec throughput and 317 ms latency). This was considered the best results achieved in this testing.

In configuration B, the number of virtual users was increased to 4 and the results show that throughput increased but latency also increased, which means no further improvement was achieved.

In configuration C, global tracking was tuning off and the results show that the throughput increased but again the latency increased as well which means no further improvement was achieved because the SAN reached its limit.

Test Case 21 Opteron Dual Core Multiple MessageBox

Objective: Because the SAN performance could not be improved, the objective of this test was to scale out the MessageBox database into 3 databases on one computer (to have the minimum DTC overhead) with 3 HBA cards (one for each MsgBox database), although this time using the dual-core Opteron because it had much more CPU headroom (almost double the single-core Opteron CPU capacity; 1 MsgBox single core was at approximately 40 percent CPU utilization while 1 MsgBox single core dual-core was at approximately 20 percent in comparable tests).

Hardware Configuration

This test is the same as the previous tests where 7 computers running BizTalk Server were used as 2 Sending & Receiving, 1 Sending, and 3 Sending & Tracking. The following figure shows the hardware configuration for test case 21:

Figure 23 Hardware configuration - test case 21

Aa475435.Local_2053840545_btsperftune_test21(en-US,BTS.10).gif
Aa475435.note(en-US,BTS.10).gifNote
Three tracking host instances were configured for the three MessageBoxes configurations because it is recommended to have one tracking host instance per each message box, and BizTalk Server will try to evenly distribute the assignment of message boxes to the tracking host instances so that the tracking performance does not degrade.

Software Configuration

Using this hardware configuration, the following software configurations were tested. The configuration parameters that varied in each configuration are in bold type.

Table 82 Software configurations for test case 21

Parameter A B C* D

Number of Virtual Users

4

30*

3

4

HTTPBatchSize

1

1

1

1

In-Process MaximumReceiveInterval (ms)

50

50

50

50

RequestMessageSize (kb)

5

5

5

5

ResponseMessageSize (kb)

5

5

5

5

Msg Body Tracking

Y

Y

N

N

XLANG MaximumReceiveInterval (ms)

50

50

50

50

Isolated Host MaximumReceiveInterval (ms)

50

50

50

50

BTSSvc3.0 MessagingLMBufferCacheSize

50

50

50

50

BTSSvc3.0 RequestQueueSize

256

256

256

256

BTSSvc{guid} MessagingThreadPoolSize

20

20

20

20

BTSSvc{guid}HTTPOutMaxConnection

48

48

48

48

BTSSvc{guid}HTTPOutInflightSize

300

300

300

300

MaxWorkerThreads

100

100

100

100

MaxIOThreads

100

100

100

100

HTTPOutCompleteSize

1

1

1

1

InProcess BatchSize

Default

Default

Default

Default

HBA driver QueueDepth

Default

Default

Default

Default

HBA driver QueueTarget

Default

Default

Default

Default

HBA driver NumFCPContacts

Default

Default

Default

Default

HBA driver CoalesceMSCount

Default

Default

Default

Default

# MessageBox Databases

3

3

3

3

*30 virtual users with 0.5 sec delay between requests.

Test Results

The results of test case 21 are shown in the following tables.

Table 83 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

A

94

95

101

96

226

214

328

311

171912

B

100

104

135

131

236

235

380

378

196042

C

82

82

84

81

179

179

261

261

194248

D

98

98

90

85

194

195

288

290

177098

*All time is in messages per second.

Table 84 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

17

15

n/a

50

51

19

n/a

14

0.02

B

18

18

17

59

59

22

44

15

0.03

C

35

13

12

62

54

36

45

29

0.58

D

37

30

28

68

68

40

50

32

0.58

Table 85 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

A

482

499

n/a

590

352

400

n/a

2075

285

B

531

549

533

635

406

424

4232

2075

285

C

587

500

473

587

370

399

4221

2075

285

D

507

526

506

597

391

412

4087

2075

285

Conclusion

The results of configuration A show no major improvement (94 msg/sec throughput and 328 ms latency compared to test case 20 configuration B of 109 msg/sec throughput and 349 ms latency).

In configuration B, compared to configuration A, there is also no improvement as more throughput also produced higher latency.

In configuration C, the results show that because the number of virtual users was decreased and tracking was turned off, the throughput decreased and latency also decreased. Compared to configuration B this is expected, but compared to test case 14 configuration C there is some improvement (82 msg/sec throughput and 261 ms latency compared to test case 14 configuration C of 75 msg/sec throughput and 251 ms latency).

In configuration D, the number of the virtual users was increased and the results show much more throughput and slightly higher latency compared to configuration C (98 msg/sec throughput and 288 ms latency compared to configuration C of 82 msg/sec throughput and 261 ms latency).

The following sections give a summary of the performance test results, which includes highlights on throughput and latency, BizTalk Server performance, database and SQL Server performance, and network performance.

Throughput and Latency

The results show that in each configuration, decreasing the throughput (by decreasing the number of virtual users sending messages to BizTalk Server) improves latency. This relationship can be clearly shown from the results of test case 6. In that test case, 5, 7, and 8 virtual users were tested. The following figure shows the results:

Figure 24 Virtual users vs. throughput and latency

Aa475435.Local_348609562_btsperftune_fig24(en-US,BTS.10).gif

The relationship between throughput and latency is due to resource contention mainly in the MessageBox database. For a given configuration with fixed resource limits (especially CPU speed and disk I/O speed), more throughput means faster processing is needed to keep the same message latency, although if the MessageBox database computer CPU reaches its limits (or even just high utilization, for example, greater than 60 percent) or the SAN I/O starts to degrade (due to more I/O requests than it can handle and queuing if I/O starts to build up) then it will take more time to process and send messages and hence latency increases.

Computers Running BizTalk Server

To prevent resource contention, isolate the tracking service and transport adapters onto separate hosts.

Although the computers running BizTalk Server should be fast enough for processing message receiving and sending by scaling up the receiving and sending servers (by using multi-fast CPU computers) the tests show that scaling out (by adding as many computers as needed) was more important, which can be explained as an effect of parallelism.

The best results were achieved with 7 computers running BizTalk Server (2 configured for receiving and sending messages and 5 configured for sending messages only).

The other important factor was the balance between the receiving servers and the sending servers. In BizTalk Server 2004, the receipt of messages is faster and less expensive than processing and sending messages.

Tests showed that with more than 2 computers running BizTalk Server receiving and fewer than 7 computers running BizTalk Server processing and sending, a backlog of messages waiting to be processed started to accumulate, causing a significant increase in message latency. In extreme cases, the message queue in the BizTalk Message Box database increased significantly, causing processing delays of 20 seconds or more.

The following figure shows how throughput and latency improved by increasing the number of sending servers from 4 to 7 with 2 receiving servers.

Figure 25 Receiving and sending vs. throughput and latency

Aa475435.Local_348609563_btsperftune_fig25(en-US,BTS.10).gif

Monitoring the computers running BizTalk Server during testing revealed that the CPU utilization of these computers was low. On the 4 CPU computers, the average CPU utilization was around 5 percent, while the 2 CPU computers were between 30 percent and 40 percent, especially when they were used to host the receiving and sending hosts.

While in the tests 4 CPU servers were used, 2 fast CPU computers are sufficient to function as computers running BizTalk Server in this scenario. 4 CPU servers were underutilized, even during tests with high throughput and low latencies.

Database and SQL Server Performance

SQL Server performance is the most important factor for the overall performance of BizTalk Server 2004. When using BizTalk Server 2004 for messaging only (that is, where no other features such as orchestration, Human Workflow, or Business Rules Engine are used), the only BizTalk Server databases used are:

  • BizTalkMgmtDb: Used to store the configuration settings for the BizTalk group
  • SSODB: Used to store the sensitive configuration settings encrypted for the BizTalk group
  • BizTalkDTADb: Used to store the tracking data for the BizTalk group
  • BizTalkMsgBoxDb: Used to store the received messages until they are processed and sent. It contains a spool table where messages are first stored when they are received before they are processed and moved into the right application queue tables. It also stores the state of the system and message subscriptions data.

Among these databases, the BizTalkMsgBoxDb database is the busiest database at runtime and therefore the performance of the SQL server running this database is very critical to the overall performance of the BizTalk group. Each BizTalk group has at least one instance of this database and more instances can be added to scale it out.

In this testing, a separate computer was dedicated to the MessageBox database and another computer was dedicated for the other BizTalk Server databases. The third computer running SQL Server was used for a custom functionality in this scenario such as error logging from the custom pipeline components and for audit trail.

For the MessageBox database server, the most powerful server computer should be used to achieve the best performance. In this performance tuning testing, the following computers were tested for the MessageBox database:

  • Intel Xeon 8-way (HT) 3.0 GHz 12 GB RAM running the 32-bit version of the Microsoft Windows Server™ 2003 operating system and the 32-bit version of SQL Server 2000 with Service Pack 3a (SP3a)
  • AMD Opteron 4-way single-core, 2.4 GHz 16 GB RAM running the 64-bit Windows Server 2003 with SP1 and the 32-bit version of SQL Server 2000 with SP4
  • AMD Opteron 4-way dual-core, 2.2 GHz 16 GB RAM running the 64-bit version of Windows Server 2003 with SP1 and the 32-bit version of SQL Server 2000 with SP4

For the tracking database and other databases:

  • Intel Xeon 8-way (HT) 3.0 GHz 12 GB RAM running the 32-bit version of Windows Server 2003 and the 32 bit version of SQL Server 2000 with SP3a

The SQL network connectivity used was the default TCP/IP. The SQL memory setting for the Xeon computer was also the default setting (which is up to 2 GB) and for the Opteron computer it was fixed to 4 GB.

The disk performance (I/O speed and volume) for those computers was very critical. SAN storage was used and it was ultimately the bottleneck. It would have been possible to achieve better performance if the SAN was able to perform more I/O faster.

32-bit 8-Way Xeon versus 64-bit 4-Way Opteron

The following tables compare MsgBox database server performance between the 32-bit 8-Way Xeon and the 64-bit 4-Way Opteron computers from test cases 8 and 9.

Table 86 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

8-A

71

76

252

243

398

392

642

644

140111

9-A

128

123

979

329

1045

580

1912

905

245852

*All time is in messages per second.

Table 87 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

8-A

30

19

21

38

16

25

47

17

0.07

9-A

45

45

47

22

16

73

73

20

0.02

Table 88 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

8-A

350

310

285

438

283

324

1987

2017

351

9-A

570

515

542

446

325

513

3508

2049

260

The results above show the following:

  • Moving from the 32-bit 8-Way Xeon to the 64-bit 4-Way Opteron increased throughput from 71 msg/s to 128 msg/s (with the same number of virtual users) and latency increased from 642 ms to 1912 ms, if you compare the mean roundtrip times, and from 644 ms to 905 ms, if you compare the median roundtrip times.
  • CPU utilization increased from 47 percent for the 32-bit 8-Way Xeon to 73 percent for the 64-bit 4-Way Opteron.
  • Memory used by the 32-bit 8-Way Xeon was approximately 2 GB while the 64-bit 4-Way Opteron used approximately 3.5 GB.

This means that higher throughput causes higher latency. The 64-bit 4-Way Opteron allowed for more messages to be received, processed, and inserted into the system, allowing the same number of virtual users to insert more work. This created more contention on the message box which causes higher latency times.

Aa475435.note(en-US,BTS.10).gifNote
The physical memory used by the 32-bit 8-Way Xeon SQL Server was the default maximum value for applications on a 32-bit Microsoft Windows® operating system, which is 2 GB. For the 64-bit 4-Way Opteron SQL Server, the behavior of SQL memory management was changed to be fixed at 4 GB.

Number of Message Boxes

As the most contended resource within the architecture, the single BizTalk MessageBox database is an obvious performance bottleneck.

Each BizTalk group has at least one instance of this database, and more instances can be added to scale it out. This scalability feature has two important factors to consider:

  • One instance of the MessageBox databases (called the master MessageBox) is always used all the time even if the messages are stored in and retrieved from the other non-master MessageBox databases. This is because subscription processing has to be done in the master database.
  • Because when additional MessageBox databases are used to distribute the load of storing and retrieving the messages, the master MessageBox database still has to be used for subscription processing and therefore each message processing is done within a transaction that spans multiple databases (the master MessageBox database and the additional MessageBox database used for storing and retrieving it). This distributed transaction between the database servers is coordinated by the Distributed Transaction Coordinator (MSDTC) service, which adds an extra performance overhead to the message processing transaction. This extra performance overhead is higher when the additional MessageBox databases are physically on a different computer than the master MessageBox computer.

It is also important to note that although it has been shown in other benchmarks that multiple message box configuration allows for more throughput, in this case it was aimed at achieving low latency thereby limiting the ability to take advantage of the larger bandwidth that multiple message boxes produced.

In test case 19, multiple MessageBox configurations were tested. The results of configuration A with 3 MessageBox databases show that throughput decreased from 77 msg/sec to 74 msg/sec compared to test case 18 configuration D with single MessageBox database and latency increased from 288 ms compared to 302 ms. This means that the gain (if there was any) of splitting the load into 3 MessageBox databases was offset by the extra overhead of the DTC service (that is needed when using multiple MessageBox databases). Therefore, although the performance was not improved significantly, this multiple MessageBox configuration produced more consistent results (fewer spikes in latency) during the test. This can be seen by comparing the mean and median latency numbers.

In configuration D with 4 MessageBox databases on different computers running SQL Server, the results show little improvement in throughput although with similar latency.

In configuration E with 5 MessageBox databases on different computers running SQL Server, the results show a little more improvement from the previous configuration D in terms of throughput and latency, although still the performance of the single MessageBox in test case 18 configuration D (77 msg/sec throughput and 288 ms latency) is better than in this configuration (68 msg/sec throughput and 304 ms latency).

The following figure shows the results of MessageBox databases vs. throughput and latency:

Figure 26 MessageBox vs. throughput and latency

Aa475435.Local_348609560_btsperftune_fig26(en-US,BTS.10).gif

In order to assess any performance benefit that could be derived from using the dual-core Opteron processor (which has more CPU headroom than the single-core CPU), 3 MessageBox databases configuration was tested in test case 21. Additionally, each MessageBox database was serviced by a separate HBA card to spread out the load on the SAN I/O. The results of this test case configuration C (82 msg/sec throughput and 261 ms latency) show some improvement compared to test case 14 configuration C (75 msg/sec throughput and 251 ms latency) with single MessageBox.

In configuration D, the number of the virtual users was increased and the results show more throughput and slightly higher latency compared to configuration C (98 msg/sec throughput and 288 ms latency compared to configuration C (82 msg/sec throughput and 261 ms latency) which can be considered as better performance than test case 14 configuration C (75 msg/sec throughput and 251 ms latency) with single MessageBox if the extent of the improvement in throughput is compared to the degradation in latency.

Note also that during the test of this configuration, high service times were observed on the SAN I/O, which capped the performance.

Message Body Tracking

The results of the test cases 14 and 15 show the impact of message body tracking on performance.

  1. Test Case 14 – Configuration A: Message body tracking enabled at 4 points (the most expensive as explained in the Tracking section under the Test Description at the beginning of this document).
  2. Test Case 14 – Configuration B: Message body tracking disabled.
  3. Test Case 15 – Configuration A: Message body tracking enabled at 1 point only.

The following tables show the results of three comparable configurations:

Table 89 Throughput and Latency

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

14-A

79

83

104

100

211

210

317

312

152218

14-B

92

98

92

89

209

210

301

303

177730

15-A

87

92

99

100

226

230

333

227

166629

*All time is in messages per second.

Table 90 Percentage of CPU

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

14-A

15

13

10

41

32

15

56

12

0.03

14-B

18

13

12

n/a

44

20

46

14

0.04

15-A

18

10

14

n/a

29

20

52

13

0.06

Table 91 Average memory used (MB)

Config BPI4X-C02 BPI4X-O02 BPI4X-O03 BPI2X-C05 BPI2X-C06 BPI4X-A02 BPI8X-O01 BPI8X-M01 BPI8X-K02

14-A

491

372

373

511

426

394

4097

2059

261

14-B

480

376

375

n/a

446

396

4157

2059

261

15-A

475

383

373

n/a

435

396

4072

2059

263

The results of 14-B show that the throughput was more than in 14-A and latency was lower, and that is because of message body tracking in 14-A while there was no message body tracking in 14-B.

The results of 15-A show that the throughput was more than 14-A and latency was also higher and that is because in this case message body tracking in 15-A was less expensive than 14-A and that caused the throughput to increase which caused the latency to increase as well.

As expected, when comparing 15-A (with message body tracking at 1 point) to 14 –B (with no message body tracking), the result show that message body tracking at 1 point still has a negative impact on performance.

It is also worth noting that when not having a high load, the overhead of message body tracking is not as high, although it is also not negligible.

The following figure shows the results of MessageBox tracking versus throughput latency:

Figure 27 Message Body tracking versus throughput and latency

Aa475435.Local_348609561_btsperftune_fig27(en-US,BTS.10).gif

Disk Input/Output and SAN Performance

The SAN used in the testing was from 3PAR. The 3PAR SAN was configured with 134 x 10K RPM discs. 3PAR employs 3-level virtualization based upon a mapping methodology.

The first level of mapping virtualizes physical disk drives of any capacity into a pool of uniform-sized "chunklets" (256 MB each). These fine-grained chunklets eliminate underutilization of storage assets by permitting volumes to be sized precisely and not according to large arbitrary increments. Complete system access to every chunklet eliminates large pockets of inaccessible storage. Performance is enhanced, even for small volumes, since the underlying chunklets are distributed across scores, or even hundreds, of disks.

The second level of mapping associates chunklets with Logical Disks (LDs). Logical Disks are intelligent compilations of chunklets based on RAID characteristics and the location of chunklets across the system. LDs are tailored to meet precise cost, performance, and availability characteristics. The first and second level mappings result in a massive parallelism of workloads across disks, Fibre Channel loops, and Controller Nodes. This load balancing occurs simply and automatically, eliminating the need for array planning or disk management.

The third level of mapping associates Virtual Volumes (VVs) with all or portions of an underlying LD or multiple LDs. VVs are the virtual capacity representations that are ultimately exported to hosts and applications. A VV can be coherently exported through as many or as few 3PAR InServ Storage Server ports as desired.

The following figure shows 3PAR storage virtualization:

Figure 28 Three par storage virtualization

Aa475435.Local_348609558_btsperftune_fig28(en-US,BTS.10).gif

The cabinet is configured with 2 controllers each having 8 GB RAM. Each volume was configured as RAID 1. This means that the disks are mirrored, which allowed the 3PAR to do the striping. The servers were connected by a Brocade 3800 2 G fiber channel switch. The servers used Emulex HBA cards to connect to the SAN.

3PAR provided monitoring tools to monitor the SAN behavior during testing—in particular the number of I/O operations performed on the SAN, I/O sizes, Queue Length, and the services times (being the length of time in milliseconds that the SAN takes to complete I/O transactions) in second intervals.

Using this monitoring tools, the following was observed:

  • Although the SAN I/O figures indicated that the SAN performance numbers did not reach the maximum values, meaning it was not saturated yet, there were correlations between the SAN I/O service time spikes and immediate effects on message latency.
  • These effects were especially pronounced during SQL checkpointing events under load where SQL Server synchronizes the transaction log and the data file. Therefore, during such checkpointing events, a significant effect on SAN utilization was observed, and as a knock-on effect, SQL Server was not able to process messages as quickly. This resulted in a higher latency.

These observations indicated that SAN I/O was the bottleneck that was reached because, while there was enough headroom in all other resources, when the throughput increased the latency also increased, and the SQL checkpointing events were obviously causing spikes in message latency.

To reduce or minimize the effect of SQL checkpointing (which produces high I/O for a short period every 1 minute or so) the SAN I/O performance should be improved such that the SQL performance could "ride out" the checkpointing events without adversely affecting the message latencies.

As switching off SQL checkpointing is not an option, varying the SQL checkpointing intervals did not provide an improvement due to the following:

  • Making checkpointing less frequent caused more work for SQL to do at checkpoint events, resulting in even larger spikes and larger impacts on the SAN I/O.
  • Making checkpointing more frequent caused SQL to be checkpointing more often, which resulted in multiple smaller spikes, each affecting latency adversely although to a lesser amount.

As an indication of the SAN performance and the I/O requirements, the following table shows the numbers recorded (using the SAN tools) for the virtual volume that had the MessageBox database in the tests that were bound by the SAN I/O:

Table 92 Input/Output per second

File I/O per sec Kbytes per sec Svt ms IOSz KB

Cur

Avg

Max

Cur

Avg

Max

Cur

Avg

Cur

Avg

Data File*

5320

338

5320

122830

7797

122830

3.4

3.4

23.1

23.1

Log File

999

745

1051

21296

10299

33934

0.9

0.8

21.5

13.7

The numbers above show that the service time for the data file I/O during SQL checkpoints was 3.4 ms, which is very high. The service time for I/O should be in microseconds. Even for the log file I/O, the service time of 0.9 ms is high.

It is important to note that different SAN architectures require different configurations. For the type of workload characteristic that BizTalk Server uses, it is best to have LUNs with as many fast spindles (physical disks) as possible.

Some SAN architectures allocate large spaces on fewer drives, meaning that for a given LUN there are fewer disks. For these architectures, more disk space should be allocated than is needed to ensure that more spindles are used.

Network Performance

1 GB bandwidth network connections were provided between all computers in the configuration, with the exception of the connection from the test harness computer to the computers running BizTalk Server, which was provided by 100 MB network connection.

The test harness computer (used as the load generator for LoadRunner as well as for the test harnesses Web applications) utilized, on average, less than 20 percent of the available 100 MB bandwidth.

However, the SQL Server computer running the MessageBox database had 1 GB connection, and 120 MB of the bandwidth was used under load (on average) and no discernible increase was noted. The best explanation for this is that even though the network bandwidth had more headroom, the SAN I/O limited the SQL Server performance.

The network card used was the standard Broadcom 10/100/1000 NICs that come on all HP servers. For more information, see the Hewlett Packard Web site here.

While the default settings in BizTalk Server 2004 provide optimal performance for many hardware and software configurations, for low latency performance you should consider the following performance tuning guidelines:

  • The tests show that given the same resources, throughput and latency are proportional. To achieve the best results you therefore need to achieve the right balance of sending and receiving load distribution.
  • To prevent resource contention, you should isolate the tracking service and transport adapters onto separate hosts.
  • While the CPU on the computer running BizTalk Server should be fast enough for processing message receiving and sending (which can be easily achieved by scaling up the receiving and sending servers by using multi-fast CPU computers), scaling out by adding as many computers as needed and balancing the ratio of the number of receiving computers to sending computers for parallelism is the key to achieving low latency.
  • The effect of the ratio of the number of receiving computers to the number of sending computers is that receiving is more complex than sending (it includes, for example, subscription matching, inserting message bodies and properties). Senders may therefore "starve" if there is too much receiving because it can take away (lock) resources from senders. Having more senders, however, is not sufficient; it is also important to have fewer receivers or and/or less input from receivers to ensure that the senders can get to the messages and do their work efficiently. The best low-latency results were achieved with 2 receiving host instances and 7 sending host instances.
Aa475435.note(en-US,BTS.10).gifNote
The Messaging Isolated service class (as well as XLANG/s and MSMQt) is not used in this scenario (it is only used for 2-way receive ports). Therefore, the value for this service class should be increased to a high value, and not decreased. This will result in lower contention on the MessageBox database. This is because when increasing this value for the Message Isolated the messaging agent polls the message box less frequently, leaving more room for the Messaging In-process messaging agent to poll the message box.

  • The parameters used to decrease the receive latency are:

The setting of the MaxReceiveInterval and the BatchSize depend on whether the receive port is 2-way or 1-way. For a 1-way receive port, you need to increase the value of the MaxReceiveInterval parameter (for example, 1000 ms) so that the adapter de-queues very slowly, producing less contention on the queue table for the sending ports that should pick up the received messages as quickly as possible and send them over. The BatchSize is irrelevant for 1-way receive ports.

The most important parameter in this scenario (1-way receive port) is the HTTPBatchSize, which is set to 1, so that the HTTP receive adapter submits each message as soon as it receives it. With this setting, the receivers are receiving as fast as possible.

You consequently need to make the senders as fast as possible so that you do not have many messages backlogged in the application queue. If possible, it would be best if the application queue is always having almost no messages backlogged at all times. In this testing the best results were achieved with an average application queue depth of approximately 25 messages at all times. Therefore, we recommend that you always monitor the number of records in the application queue table.

  • The parameters to speed up the sending servers are:
  • The number of subscriptions and their complexity can affect the throughput while maintaining the same latency (or lower, given the same configuration). Some quick testing (which could not be documented due to time constraints) showed that adding 40 simple subscription criteria or filters decreased the throughput from 70 to 66 msg/sec. Adding 40 more dropped the throughput further to 63 msg/sec with almost the same latency.
  • The SAN I/O performance seems to be the most difficult resource to scale. No matter how many messages you get through the system, you need to make sure that the SAN is capable of doing the required rate of I/O without degradation in the I/O response times. The tests showed that although the SAN was still capable of doing the required number of I/Os (because it could split this number of I/Os across as many disks as it had), the I/O response times were negatively impacted as throughput increased to high numbers.

To achieve low latency on average and avoid spikes, the MessageBox SQL Server computer should be between 40 percent and 50 percent CPU, and the disk should be capable of doing the required volumes of I/O without degradation in response times.

More than 21 test cases (a total of 50 configurations) were tested to find the optimal configuration which can achieve message latency of 300 ms with the maximum throughput possible for messaging-only scenario using the HTTP transport in one-way receive, and two-way and one-way send cycles with XML request messages of 5 KB in size and XML response messages of 9 and 5 KB in size using BizTalk Server 2004.

Test cases 1 through 14 were about trying to bring the message latency down to an average of approximately 300 ms. The remaining test cases were about trying to achieve better results or reach bottlenecks and trying to scale to achieve the maximum throughput possible while still keeping the low latency. Finally, the bottleneck was the MessageBox database SAN I/O that could not be improved or scaled.

The following table shows the best results achieved with message body tracking:

Table 93 Best results for Message Body tracking

Config Throughput* Request Time ResponseTime RoundtripTime # of roundtrips

Mean

Median

Mean

Median

Mean

Median

Mean

Median

21-A

94

95

101

96

226

214

328

311

171912

20-A

95

96

95

91

206

203

307

307

173463

19-A

74

74

94

93

204

200

302

298

134368

18-D

77

89

91

88

196

192

288

284

159891

17-B

76

87

88

87

183

180

270

267

157702

16-A

77

82

81

80

198

204

274

277

148596

16-B

80

89

87

89

201

191

293

287

174913

15-B

75

79

81

75

184

189

267

265

144387

14-A

79

83

104

100

211

210

317

312

152218

*All time is in messages per second.

Test case 20 configuration A is considered the best result taking into account that latency is approximately300 and throughput is the highest. It is also worth noting that the mean and median are the same, which also means that the performance was steady during the whole test time.

The SAN performance was ultimately the bottleneck to achieve more throughput while maintaining the same latency.

In this configuration, 5 computers running BizTalk Server were sending and 2 computers running BizTalk Server were receiving and sending. Additionally, there were 2 SAN HBA adapters on the computer: one was used for the MessageBox log file and the second was used for the data file.

3 virtual users were used to represent a constant rate of request messages received by BizTalk Server receive servers without any delay (or think time). The following figure shows message throughput:

Figure 29 Message throughput

Aa475435.Local_348609559_btsperftune_fig29(en-US,BTS.10).gif

As can be seen from the chart above, the reported average of 95 msg/s was accurate, as the number of messages received on a per-second basis fluctuated between 89msg/s and 103msg/s. This jagged behavior is to be expected, as BizTalk Server receives and processes messages in a non-linear fashion (e.g., BizTalk Server receives messages in batches); in this case each time a message was received it was processed. The following figure displays the time taken to process each message:

Figure 30 Request versus response

Aa475435.Local_348609663_btsperftune_fig30(en-US,BTS.10).gif

As is indicated by this chart, and indeed from all of the tests, the message request time is significantly lower than the response time. Given this, the roundtrip time (being made up of the sum for each message of the request and response time) seems to mimic the response time in terms of dynamics; however, this is simply because LoadRunner samples the counters all at the same time.

These charts are made up of a sampling of the message times, and not all (as LoadRunner only samples the messages once every second and the chart is plotted to 6 second intervals). It is possible that some messages were higher or lower than those shown here.

List of Contributors: Angel Pallares - Scalability Test Analyst, Accenture; Andrew P. Glenister - Scalability Test Manager, Accenture

Show:
© 2014 Microsoft