.gif)
SQL Server
Technical Article
Writers: Len Wyatt, Tim Shea, David Powell
Published: March 2009
Applies to: SQL Server 2008
Summary: In February 2008, Microsoft announced
a record-breaking data load using Microsoft® SQL Server® Integration Services (SSIS):
1 TB of data in less than 30 minutes. That data load, using SQL Server
Integration Services, was 30% faster than the previous best time using a
commercial ETL tool. This paper outlines what it took: the software, hardware,
and configuration used. We will describe what we did to achieve that result,
and offer suggestions for how to relate these techniques to typical scenarios. Even
for customers who don't have needs quite like this benchmark, such efforts can
teach a lot about getting optimal performance.
Introduction
Businesses have ever-increasing volumes of data stored in
many heterogeneous systems. To provide meaningful, consistent, and reliable
information to their end users, these businesses rely on data integration
technologies to extract, transform, and load data (commonly known as ETL
operations) as they move data between systems. ETL tools such as Microsoft SQL
Server Integration Services (SSIS) support these types of data integration
activities. Businesses want to know that an ETL tool they choose will be able
to support any data volume they might require and at the same time allow them
to integrate data from any of their heterogeneous data sources.
To illustrate the ability of SSIS to meet such performance
needs, Microsoft and Unisys arranged to load over 1 terabyte (TB) of data that
was read from flat files on four source servers into a SQL Server database on a
single destination server. In this test, the data was read, converted from text
fields to database data types, transferred over the network, and inserted into
the destination database in less than 30 minutes. To be precise, 1.18 TB
of flat file data was loaded in 1,794 seconds. This is equivalent to
1.00 TB in 25 minutes 20 seconds or 2.36 TB per hour.
It is important to note that this is not the same as simply
doing a bulk load of the data into the database. If data is available on the
destination system, and if it does not need to be standardized or corrected for
errors before it is loaded, bulk loading data makes sense. In the more common
case where data must be moved between systems and transformed along the way,
ETL tools are needed. ETL tools like SSIS can perform functions such as moving
data between systems, reformatting data, integrity checking, key lookups, and
tracking lineage. In our 1 TB loading experiment we did not perform
extensive transformations, because we wanted to run an experiment that would be
comparable to what other ETL tool vendors have published.
The idea of being able to compare ETL tool performance is an
important one. Customers should be able to compare ETL tools in the same manner
that they can use a TPC-E or TPC-H benchmark today to compare relational
databases. There is no commonly accepted benchmark for ETL tools. Microsoft
thinks there should be. Industry-standard benchmarks can lead to healthy
competition, better products, and better publication of the techniques used to
get high performance. Microsoft supports the idea of defining an
industry-standard benchmark that reflects the real-world uses of ETL tools.
Design Overview
Conceptually, the design of this system is very simple, as
shown in Figure 1. There are 56 streams of data generated as flat files using
the TPC-H data generator, so there will be 56 instances of the SSIS package
executing in parallel. They will write to a single destination database. Each
instance of the package will read one of each of the file types created by the data
generator: Customer, LineItem, Orders, Part, PartSupp, and Supplier. The file
names to use are passed in as arguments to the package. The source files will
be described in the next section, and the SSIS package will be described in
detail in the section on SSIS package design.
.jpg)
Figure 1:
Conceptual data flow
On the output side, each package will write to a different
partition in the destination tables. More precisely, as illustrated in Figure
2, each package will write into a separate table for highest performance, and
the tables will be “switched in” to partitions of the larger table. This will
be described more fully in the section on database setup. There are a number of
times when partitioning a table is a good practice, one of them being when multiple
large insertions need to be performed concurrently–exactly the situation we
have for this scenario. Each of the major TPC-H tables (Customer, LineItem,
Orders, Part, PartSupp, Supplier) was partitioned for this exercise. The tiny
Nation and Region tables were not partitioned.
.jpg)
Figure 2: From
flat files to table partitions
The SSIS package that loaded the data was quite simple.
Figure 3 describes it in detail. The control flow of the package consisted of
six data flows, one for each of the major tables, which run sequentially. Each
data flow has a flat file source sending data to an OLE DB destination. The
flat file source contains the field definitions for its file type, and the data
types the fields should be converted to (32-bit integer, date/time types,
money, and so on). The OLE DB destination contains the mapping of columns in
the data flow to columns in the destination table in the relational database.
Note that we used an OLE DB destination, not a SQL Server destination. This
means that the package could have written data to any type of relational
database that has an OLE DB adapter, not just SQL Server, and that the
destination database does not have to be on the same machine that is running
the SSIS package. More details on the package follow in the section on SSIS
package design.
.jpg)
Figure 3: SSIS
package overview
Physically, the source files are distributed across four
source servers and the destination database is on a Unisys ES7000 server. As
shown in Figure 4, the SSIS package instances execute on the source servers,
and the SQL Server relational engine runs on the ES7000. Each source server is
connected to the destination using two 1 Gb Ethernet connections in order
to provide sufficient bandwidth for this scenario. The network connections were
driven to between 70% and 100% of capacity, so it was essential to create ideal
network settings to avoid overloading the CPUs with network interrupt work.
More details about the server, operating system, and network setup are in
following sections of this document.
.jpg)
Figure 4:
Physical topology
Source Data
We used the data generator from the Transaction Processing
Council’s TPC-H benchmark to generate data for this experiment. There were two
reasons for that choice:
·
The generator provided a convenient way to
generate realistic data that would be recognizable by many people.
·
Choosing this data set allowed us to make a
direct comparison to the results achieved by other vendors who have used the
same data set.
Although the TPC-H data generator was used, this is not a
TPC-H benchmark result. It was simply a convenience to use that data set. As
noted earlier, there is not a commonly accepted benchmark for ETL tools at this
time.
The TPC-H data generator is called DBGEN, and it generates
text files for loading into databases. Although SSIS can move data directly
from one database to another, and this usage pattern is encouraged because of
its efficiency, it is not uncommon in the ETL world to unload data from one
database to flat files and load those into another. In effect, flat files are
the “lowest common denominator” for data exchange. Because DBGEN generates flat
files, we decided to use them exactly as they were generated.
DBGEN provides the option to partition the data into a
number of streams that can be loaded in parallel. For example, the command
dbgen –T o –fF –q –b dists.dss
-s 1000 -C 56 -S 6
will generate data for the ORDERS and LINEITEMS tables at
the scale point 1000 (roughly 1 TB total data size). The data will be
partitioned into 56 streams, and the command will generate data for stream 6.
When all the data for stream 6 is generated, the following
files will be present:
02/14/2008 07:31 PM
444,676,242 customer.tbl.6
02/14/2008 08:12 PM
14,551,494,208 lineitem.tbl.6
02/14/2008 08:12 PM
3,235,968,117 orders.tbl.6
02/14/2008 08:17 PM
444,959,263 part.tbl.6
02/14/2008 08:17 PM
2,216,550,617 partsupp.tbl.6
02/14/2008 08:18 PM 25,768,892 supplier.tbl.6
We loaded data from 56 streams in parallel, so
correspondingly we generated 56 customer files, 56 line item files, and so on.
The data for each of the 56 streams is fairly similar in size, and collectively
the set adds up to 1.18 TB of source data. Later we will describe how the
files were laid out on disk to achieve the needed performance. The files are
plain text files with variable length fields and vertical bars (“|”) as the
field separators. As an example, here are the first three lines of the
part.tbl.6 file (truncated for readability):
17857141|PROMO BURNISHED
NICKEL|17|Brand#54|cream peru …
17857142|LARGE BRUSHED
TIN|47|Brand#51|violet drab …
17857143|MEDIUM POLISHED
BRASS|3|Brand#32|orange yellow …
At run time, SSIS will read the files in this format as the
first step in the preparing, transferring, and loading the data.
Database Setup
In many ways the configuration of the SQL Server database
used default settings or common configuration practices. We will review the
details, but the two unusual settings were the use of software non-uniform
memory access (soft-NUMA) to distribute work evenly among the processors and
the use of the –x startup flag for SQL Server. Soft-NUMA will be discussed in
more detail below. The –x option turns off the collection of run-time
performance statistics that appear in performance counters and DMVs. There was
one particular counter that was impacting this effort, and in subsequent builds
a fix for that problem has been introduced. This fix is in the released product
code. It is no longer necessary to run with –x to obtain performance similar to
what we achieved.
File Groups and Tables to Load
The database was created on 16 data volumes (logical disk
drives) plus a log volume. All volumes were provided by the EMC SAN. A separate
file group was created for each incoming stream of data, one file per file
group, with the files placed on the data volumes in a round-robin assignment
[ref01]. Here is how the database was created.
CREATE DATABASE TPCHdestination ON
PRIMARY
( NAME = N'TPCHdata0',
FILENAME = N'C:\Mount\Drive1\SQLdata\TPCHdata0.mdf' ,
SIZE = 1GB , MAXSIZE = UNLIMITED, FILEGROWTH = 10% ),
FILEGROUP FG1
( NAME = N'TPCHdataG1F1',
FILENAME = N'C:\Mount\Drive1\SQLdata\TPCHdataG1F1.mdf' ,
SIZE = 24GB , MAXSIZE = 24GB ),
FILEGROUP FG2
( NAME = N'TPCHdataG2F1',
FILENAME = N'C:\Mount\Drive2\SQLdata\TPCHdataG2F1.mdf' ,
SIZE = 24GB , MAXSIZE = 24GB ),
. . .
LOG ON
( NAME = N'TPCHdata_log',
FILENAME = N'C:\Mount\Log\SQLlog\TPCHdata_log.ldf' ,
SIZE = 25GB , MAXSIZE = 25GB )
GO
ALTER DATABASE TPCHdestination SET RECOVERY SIMPLE
GO
sp_dboption 'TPCHdestination','auto create statistics','OFF' ;
go
sp_dboption 'TPCHdestination','auto update statistics','OFF' ;
go
alter database TPCHdestination set PAGE_VERIFY NONE ;
go
With the file groups created, tables to be loaded with data were placed in the file groups, for example,
create table ORDERS_6
(O_ORDERDATE smalldatetimenot null,
O_ORDERKEY bigint not null,
O_CUSTKEY int not null,
O_ORDERPRIORITY char(15) not null,
O_SHIPPRIORITY int not null,
O_CLERK char(15) not null,
O_ORDERSTATUS char(1) not null,
O_TOTALPRICEmoney not null,
O_COMMENT varchar(79) not null)
on FG6
Remember that each of these tables is really a temporary
location that SSIS can rapidly load data into. It will then be switched in to
partitions of the full table.
Tables are all created without indexes. This is typical for
environments where large amounts of data are loaded – it is often better to
load the data first, and then add indexes. That’s how this test was run. We
assume that adding indexes is a separate operation that follows the data load.
Indexing is not included in the load time.
Partitioning
Data is loaded into separate temporary tables because it is
the fastest way to load data. Later in this paper, we explain that loading into
a single heap is almost as fast, but simpler to set up and manage. We expect
most sites would find the simpler design sufficiently fast.
After the data is loaded into the temporary tables, they are
then switched in to become partitions in the full portioned tables. This switch
is a metadata operation only – the data does not need to be copied or
physically moved to accomplish it. This is a significant piece of the power of
partitions, and this technique has been a recommended data loading practice
since SQL Server 2005.
Every partitioned table uses a partition function and a
partition scheme [ref02]. In short, the partition function gives the boundary
points for the partitions in the table, and the partition scheme tells what
file group each partition will reside on. In a many implementations, partitions
will be divided along time boundaries, such as creating a new partition each
week for new data. In this experiment the partitioning of streams generated by
DBGEN was simply by the primary key of each table. While this is not an entirely
typical implementation, it does show how partitioning works. The partitioning
method should always be driven by the data needs of the application.
Below are the definitions of the partition function, the
partition scheme, and the table for the Orders table. For the Orders table,
partitioning is by O_ORDERKEY, so the values in the partition function are
O_ORDERKEY values. Note that the definition of the Orders table here is
identical to the temporary table (shown above) that data is initially loaded
into. This is a requirement for being able to switch the data into a partition.
CREATE PARTITION FUNCTION pfnORDER (bigint) AS RANGE LEFT FOR VALUES (
107142850,214285700,321428550,428571424,535714274,642857124,
749999974,857142848,964285698,1071428548,1178571398,1285714272,
1392857122,1499999972,1607142822,1714285696,1821428546,1928571396,
2035714246,2142857120,2249999970,2357142820,2464285670,2571428544,
2678571394,2785714244,2892857094,2999999968,3107142818,3214285668,
3321428518,3428571392,3535714242,3642857092,3749999942,3857142816,
3964285666,4071428516,4178571366,4285714240,4392857090,4499999940,
4607142790,4714285664,4821428514,4928571364,5035714214,5142857088,
5249999938,5357142788,5464285638,5571428512,5678571362,5785714212,
5892857062) -- maximum is 6000000000
CREATE PARTITION SCHEME pscORDER AS PARTITION pfnORDER TO (
FG1, FG2, FG3, FG4, FG5, FG6, FG7, FG8, FG9, FG10, FG11, FG12,
FG13, FG14, FG15, FG16, FG17, FG18, FG19, FG20, FG21, FG22, FG23,
FG24, FG25, FG26, FG27, FG28, FG29, FG30, FG31, FG32, FG33, FG34,
FG35, FG36, FG37, FG38, FG39, FG40, FG41, FG42, FG43, FG44, FG45,
FG46, FG47, FG48, FG49, FG50, FG51, FG52, FG53, FG54, FG55, FG56)
GO
create table ORDERS
(O_ORDERDATE smalldatetimenot null,
O_ORDERKEY bigint not null,
O_CUSTKEY int not null,
O_ORDERPRIORITY char(15) not null,
O_SHIPPRIORITY int not null,
O_CLERK char(15) not null,
O_ORDERSTATUS char(1) not null,
O_TOTALPRICEmoney not null,
O_COMMENT varchar(79) not null)
on pscORDER(O_ORDERKEY)
The final requirement for partition switching is that there
must be constraints on the temporary tables so that only qualifying data can be
in the tables. There must be range checks on the tables so the data will not
violate the ranges given in the partition function for the partitioned table.
Here is an example of the range constraint placed on the Orders_6 temporary
table. Again, the actual key values given are driven by the input data:
ALTER TABLE ORDERS_6 WITH CHECK ADD CONSTRAINT check_ORDERS_6
CHECK (O_ORDERKEY >= 535714275 AND O_ORDERKEY <= 642857124)
With the tables set up as described, after SSIS has loaded
data into the temporary tables, executing the switch is as simple as this:
ALTER TABLE ORDERS_6 SWITCH TO ORDERS PARTITION 6
When all of the switches are done, there will be a uniform
Orders table with all of the data. The same process is followed for each of the
major tables.
Soft-NUMA and Port Mapping
Soft-NUMA provides a way of subdividing the processors in a
server into smaller logical groups. The SQL Server scheduler is aware of these
groups, so work stays more localized than the system hardware might otherwise
require. After logical nodes are defined, port mapping can be used to direct
work to a certain node based on the port number used in the TCP connection.
More information about soft-NUMA and port mapping, as well as a discussion of
hardware NUMA, which we reference later in this paper, is available in SQL
Server Books Online [ref03].
For this exercise, we used soft-NUMA and port mapping to
force each incoming data stream (one from each SSIS package) to go to a
different processor. Partly this was to work around some idiosyncrasies in the
SQL Server scheduler (which we hope to change in future releases), but we
probably would have used NUMA and port mapping anyway to keep network traffic
local to each hardware NUMA node in the server. The details behind this will be
described further in the section on network setup. For now we’ll just say that
there were eight CPU cores in each hardware NUMA node. We directed all the
network DPC traffic for each NUMA node to one core from that node, and we used
the other seven cores for SQL Server processing. This kept incoming data from
the network local to each hardware NUMA node. Each SSIS package connected to
one of the available soft-NUMA nodes. Because there were eight hardware NUMA
nodes in the server and each had seven soft-NUMA nodes available to do work, we
ran a total of 56 streams concurrently.
We did not need to take steps to ensure locality of disk I/O
traffic to disk controllers on hardware NUMA nodes. That is another
optimization that could be considered.
The creation of soft-NUMA nodes and port mappings is done in
the system registry. The registry settings used were [ref04]:
Windows Registry Editor
Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\100\NodeConfiguration]
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\100\NodeConfiguration\Node0]
"CpuMask"=hex:01
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\100\NodeConfiguration\Node1]
"CPUMask"=hex:02
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\100\NodeConfiguration\Node2]
"CPUMask"=hex:04
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\100\NodeConfiguration\Node4]
"CPUMask"=hex:10
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\100\NodeConfiguration\Node3]
"CPUMask"=hex:08
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\100\NodeConfiguration\Node5]
"CPUMask"=hex:20
. . .
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\100\NodeConfiguration\Node62]
"CpuMask"=hex:00,00,00,00,00,00,00,40
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\100\NodeConfiguration\Node63]
"CpuMask"=hex:00,00,00,00,00,00,00,80
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft
SQL Server\MSSQL10.STAB1300_04\
MSSQLSERVER\SuperSocketNetLib\Tcp\IPAll]
"TcpPort"="2000[0x00000001],2001[0x00000002],2002[0x00000004],2003[0x00000008],
2004[0x00000010],2005[0x00000020],2006[0x00000040],2007[0x00000080],2008[0x00000100],
2009[0x00000200],2010[0x00000400],2011[0x00000800],2012[0x00001000],2013[0x00002000],
2014[0x00004000],2015[0x00008000],2016[0x00010000],2017[0x00020000],2018[0x00040000],
2019[0x00080000],2020[0x00100000],2021[0x00200000],2022[0x00400000],2023[0x00800000],
2024[0x01000000],2025[0x02000000],2026[0x04000000],2027[0x08000000],2028[0x10000000],
2029[0x20000000],2030[0x40000000],2031[0x80000000],2032[0x100000000],2033[0x200000000],
2034[0x400000000],2035[0x800000000],2036[0x1000000000],2037[0x2000000000],2038[0x4000000000],
2039[0x8000000000],2040[0x10000000000],2041[0x20000000000],2042[0x40000000000],
2043[0x80000000000],2044[0x100000000000],2045[0x200000000000],2046[0x400000000000],
2047[0x800000000000],2048[0x1000000000000],2049[0x2000000000000],2050[0x4000000000000],
2051[0x8000000000000],2052[0x10000000000000],2053[0x20000000000000],2054[0x40000000000000],2055[0x80000000000000],2056[0x100000000000000],2057[0x0200000000000000],
2058[0x400000000000000],2059[0x800000000000000],2060[0x1000000000000000],
2061[0x2000000000000000],2062[0x4000000000000000],2063[0x8000000000000000]"
"TcpDynamicPorts"=""
"DisplayName"="Any
IP Address"
Other Database Settings and Options
We set the network packet size to 32k instead of the default
4k. This was implemented on both the database side using the option network packet size (B) and in the SSIS
package as discussed in the next section.
SSIS Package Design
The design overview section and Figure 3 above gave the
high-level overview of the SSIS package design. The package is executed in 56
parallel instances. Each instance of the package loads one Customer file into
one Customer table, one Supplier file into one Supplier table, and so on, in
sequence. The sequencing is determined by the task dependencies set up in the
control flow for the package. Each of the tasks is one data flow, with a single
flat file source and an OLE DB destination.
One important consideration in an SSIS data flow is the
choice of data types to be used. In general it’s a good idea to get the types
converted in the SSIS pipeline to data types that will be native for the
destination database. That way the data does not have to be converted again by
the database. When they are read from text files, the data types are specified
in the Flat File Connection Manager editor. Figure 5 shows an example. It shows
that the column O_ORDERKEY, when read from an orders.tbl.n file, should be
converted to an eight-byte signed integer, which is the same as bigint in Transact-SQL. Each input
column can be mapped appropriately in the Connection Manager editor.
.jpg)
Figure 5: Setting
data types in the Connection Manager editor
In addition, there is an option to speed up parsing of
integer, date, and time types if the conversion does not have to be
locale-sensitive. This option is set on a per-column basis using the Advanced
Editor for the flat file source, as shown in Figure 6.
.jpg)
Figure 6: Setting
FastParse in the Advanced Editor
The destination chosen is the OLE DB destination, because
the SSIS package is running on a different server than the destination
database. Had this been a local database, we could have gotten further
optimization by using the SQL Server destination.
Setting up the outputs is a little simpler than the
text-file inputs, because the data types are already defined by the SSIS
pipeline and by the destination table definition. Because we have taken care to
select the SSIS data types carefully, they will match the database data types.
What remains is to select the options for the destination. Note in Figure 7
that we used the fast load functionality. We now have the option to place a
table lock, to turn off checking constraints, and to set the batch size.
Although we ran with a batch size of 100,000 rows for this project, that may
have more than needed – we saw no significant difference for this data set when
running with 1,000 or 10,000 row batches. We found that is was best to leave
the commit size at the default setting.
Also note in Figure 7 that the name of the table to load is
obtained at run time from a variable. This allows each package instance to
write to a different table.
.jpg)
Figure 7: The OLE
DB destination
With the package defined, it needs to be called at execution
time. This is done using the DTExec command prompt utility, as shown below.
White space has been added for clarity. Naturally this command line was
generated in a loop that started all 56 streams; this represents stream 6. Each
of the file connection managers has the file name set, and each of the
variables that names a destination table is set accordingly. This type of
coordination could also have been accomplished inside SSIS – we could have
passed in a single variable with the stream number, and SSIS could have
determined the file names and table names. That would be a reasonable option
for many sites.
DTExec.exe
/Conn DestinationDB;
"Data
Source=10.1.1.2,2006; Initial Catalog=TPCHdestination;
Provider=SQLNCLI10.1;
Integrated Security=SSPI;
PacketSize=32767;AutoTranslate=False;"
/Conn
Customer;"C:\Mount\Source6\TPCH1000GBby56\customer.tbl.6"
/Conn
LineItem;"C:\Mount\Source6\TPCH1000GBby56\lineitem.tbl.6"
/Conn
Orders;"C:\Mount\Source6\TPCH1000GBby56\orders.tbl.6"
/Conn
Part;"C:\Mount\Source6\TPCH1000GBby56\part.tbl.6"
/Conn
PartSupp;"C:\Mount\Source6\TPCH1000GBby56\partsupp.tbl.6"
/Conn
Supplier;"C:\Mount\Source6\TPCH1000GBby56\supplier.tbl.6"
/set
\Package\Customer.Variables[dynCustomerTabName].Value;CUSTOMER_6
/set
\Package\Supplier.Variables[dynSupplierTabName].Value;SUPPLIER_6
/set
\Package\Part.Variables[dynPartTabName].Value;PART_6
/set \Package\PartSupp.Variables[dynPartSuppTabName].Value;PARTSUPP_6
/set
\Package\Orders.Variables[dynOrdersTabName].Value;ORDERS_6
/set
\Package\LineItem.Variables[dynLineItemTabName].Value;LINEITEM_6
/F
“c:\etlwr\ops\src\WRpackages\WRproject\WRproject\SOLEDBstream.dtsx"
Two other items to note in the DTExec command are the Data
Source and the PacketSize values. The Data Source value is the IP address and
port number for the soft-NUMA node this package will connect to. In most sites
the Data Source value will simply be a machine name; sometimes it will be
accompanied by a SQL Server instance name. Here we bypassed the usual naming
and went directly to the IP address and port number. The PacketSize value was
set to 32K instead of the default 4K to reduce the number of network
round-trips.
System Setup
Disk Setup on the Database Server
The database server was attached to a SAN with a single rack
of disks in a mirrored RAID configuration. In this way we created a
customer-realistic storage system using the performance of current common
technology, including the ES7000 Server, EMC SAN technology, Fibre Channel
connections, and standard hard drives. The SAN was configured to expose 16 data
volumes and a log volume, which were formatted from Windows® and used by SQL
Server.
To host the SQL Server database files, Unisys supplied the
EMC SAN, a 6 by 4 Gbps Fibre Channel-connected EMC CLARiiON CX3-80 with a
single rack (11 trays by 15 disks) of 146 GB drives spinning at 15k RPM.
The SAN was configured to expose 17 LUNs, each of which mapped back to 8
drives, arranged in a 4x2 RAID 10 configuration. In this way, on each LUN, the
database files are striped across four sets of mirrored drive pairs. This
configuration is fault tolerant and leaves one hot spare per tray. At the stripe
size of 64k, across four mirrored drive pairs, the LUN should nicely handle
writes of 256k each. The lab tested and confirmed a maximal write rate at an
I/O size of 256k per write operation.
Each LUN was formatted as a logical volume and assigned to a
Windows NTFS mount point on the server's file system. This task could have been
performed in the Windows Server® 2008 Management Console Disk Management
snap-in; we used DISKPART for convenience in scripting the configuration tasks.
Note that in Windows Server 2003 you can use DISKPART to set the correct
partition alignment on disk by specifying the partition offset; Windows Server
2008 does this automatically.
Testing with the sqlio.exe tool confirmed a write rate just
over 800 MB/sec (parallel threads writing 256 KB buffers to all
LUNs).
The 16 data volumes mentioned in the Database Setup section
were mapped to the first 16 mount points. The 17th volume was used for the SQL
Server database’s log file. During the test we measured an average write rate
of 608 MB/sec with a peak of 872 MB/sec from SQL Server as shown in
Figure 8. Note how the rate changes as the SSIS servers load different TPC-H
tables at various times in the execution.
.jpg)
Figure 8:
Database server write rate during a test run
All data and log files were precreated to eliminate file
growth during the ETL process. SQL Server Instant File Initialization [ref05]
was configured for creation of the database files; the time to set up the
database was essentially the time to zero the log file. We found a 50 GB
log file sufficient, and most likely it could have been a good bit smaller.
NUMA on the Database Server
The ES7000 is a non-uniform memory access (NUMA)
architecture machine. We looked at two NUMA memory options: fully interleaved,
and stacked cell pair. We used four pairs in total.
The ES7000 was configured as a single 32 socket server from
eight four-CPU nodes, in a single rack. Each node was paired with a single
neighbor via a rigid side-panel interconnect, which is an electrical connection
linking the system bus of each machine to its twin. Such a pair of nodes
comprises a super cell, and in this case, the memory of each node is
interleaved to create a single memory system local to the super cell. Four
super cells were linked via a network of external so-called crossbar cables. In
this configuration we had the option of "stacking" the memory, where
the memory within a super cell remains local to that super cell and the effects
of NUMA are maximized, or of interleaving the memory across all super cells. We
would expect the latter approach to reduce both the pros and cons of a NUMA
design machine.
Other memory configurations may have been physically
possible but were not tested.
We tested against the two memory configurations described
above. Performance of the test was slightly better in the fully interleaved
configuration. Our working assumption is that the performance difference was a
result of poor locality of the network receive buffers with respect to the
network interface cards. In other words, the memory configuration that
maximized NUMA effects did not perform as well as the memory configuration that
minimized them. This can indicate a problem with network receive buffer
placement within the memory space of the node that physically hosts the NIC
using each buffer. It can also indicate that the network receive buffer is
missing. This could be an area for further research; we did not pursue it
because the performance difference is small, and is likely to vary on a newer
server with updated ACPI support. Information on the ACPI standard is available
online [ref06].
The difference in the performance of the test when comparing
the two NUMA memory configurations can be observed in the average write rate on
the SQL Server (note the scale is for the average across the physical disk
instance counters), where in both instances the test is CPU-bound on the
server. In follow-up testing, we observed a rate of 38.6 MB/sec with
stacked memory and a rate of 39.7 MB/sec with interleaved memory, a difference
of 2.9%, in line with the observed difference in test latency of 2.6%.
Disk Setup on SSIS Servers
The disk configuration on the SSIS servers was
straightforward and uncomplicated. We used four commodity class servers to
source the ETL data and execute the SSIS packages, and two storage arrays to
physically hold the 1.18 TB of flat file source data. In this way we modeled
the scenario in which multiple data sources, or ETL “source” servers, write to
a target warehouse simultaneously. The data processed by each SSIS server was
unique; there was no replicated or shared source data.
Four Unisys ES3220Ls (dual socket, quad core) were connected
to two EMC CLARiiON CX600s with two, 2Gbit FC connections from each server to a
CX600. Each CX600 held 45 spindles, for a total of 90 spindles serving the flat
file source data.
Each of the four SSIS servers drove seven streams of data
from each of two source volumes, for a total of 56 parallel streams of data
being processed and sent to the database server during the ETL process. Each
source volume was hosted on a LUN exposed to the source server. Each LUN was
formatted and assigned to a Windows NTFS mount point on the server's file
system using the DISKPART utility. This configuration supplied the required
read rate of 164 MB/sec to read 1.18 TB of flat file data in 1,800
seconds.
It would have been possible, but was not necessary, to
configure the SAN to expose a higher number of LUNs, with a lesser number of
spindles each, to optimize for sequential read. In our test configuration each
source server pulled seven streams of flat file data from each of two LUNs.
The TPC-H source data files were generated with the DBGEN
utility from the benchmark kit. We initially used a parallel process to write
most of the source files out in parallel. This strategy created extremely high
(unrealistically so) levels of fragmentation in the file system. By generating
the data into spare storage before moving them onto the source volumes, we
realized a small gain in read latency.
Other Settings
Few adjustments were made to system settings in reaching the
goal of a 30-minute 1.18 TB ETL load with SQL Server and Integration
Services.
On the database server, we enabled the available CPU cache
performance options: hardware prefetcher, and adjacent cache line prefetch
[ref07].
A hardware cache prefetcher is a system by which a CPU's
memory controller detects a sequential pattern of memory access, and when such
a pattern is detected, predicts the next piece of memory that will be requested
and places it in the cache. If the controller's prediction is correct, memory
access become significantly faster for the rest of the reads in the sequence,
because the CPU does not have to wait for a slow round trip to main memory and
back (or even worse, to a remote NUMA node).
Adjacent cache line prefetch simply means the memory
controller grabs twice as much memory at a time, or two lines of cache, instead
of the usual one line. In current systems a line of cache (or, cache line) is
often 64 or 128 bytes. Cache line size is a fixed attribute of the
hardware.
As noted earlier, we enabled the fully interleaved mode for
memory access. With respect to software, we enabled large code pages for the
SQL Server program image[ref08], [ref09]; please note that this is not required
even on a 64-bit version of Windows, and it should only be enabled in
production when recommended on a specific case basis by Microsoft.
On the source servers, we gave the SSIS process (DTEXEC) a
slight boost in priority class. In similar tests, we have observed that such a
boost evens out the processing rate of the data streams, which reduces overall
processing time by reducing the variability in data stream processing latency.
Because the latency of the test is defined as the time from the start of processing
to the completion of the last stream, reducing the variability in individual
stream latencies reduces the test latency. In other words, the wait time for
the final streams to complete is reduced. This effect was only observed when
the SSIS Server was CPU bound.
Networking Details
Operating System Selection
At the start of this project, we debated various system
architecture choices. Unlike Windows Server 2008 R2, Windows Server 2008 is
limited to 64 logical processors, and some early experiments suggested we
couldn’t accomplish our goals if we restricted ourselves to running everything
on a single system. Because this was our first time running a major ETL
benchmark workload, we made major choices early on that favored control and
scale-out flexibility over ease of use. We will likely pursue a different
(simpler) approach the next time.
Being able to scale out the front end of the system meant
running SSIS on different servers than the system that would host the
relational database, so we committed early on to loading over a network.
Because Windows Server 2008 would be launched at the same time as SQL Server
2008, and the new operating system included an all-new, more modern TCP/IP
stack, we felt this design choice would also give us an opportunity to kick the
tires on the new operating system’s next-generation networking.
Network Interface Card (NIC) Selection
The Unisys ES7000 server model we used was an older mature
cache-coherent NUMA machine [ref10], since replaced by Unisys by a newer model
with a faster memory interconnect. We quickly rejected the idea of using NIC
teaming, for fear of not being able to control NUMA placement if we hit
chokepoints in the server’s interconnect. This older server system did not have
the range of qualified NIC choices available in newer machines today, so we
steered clear of using 10 Gigabit Ethernet NICs for this first go at setting a
new ETL world record.
The NICs on the Unisys server were the built-in, Intel
PRO/1000 MT LAN-on-Motherboard (LOM) controllers. We used what was then the
latest publicly-available version of Intel’s driver for this Ethernet
controller. On the client side, we used Intel PRO/1000 PT NICs. These worked
really well, both in terms of performance and rock-solid stability.
Parallelism in the Network Topology
We settled on using four front-end multicore “client”
servers, to run SSIS. These systems would read the flat file data from
RAID-backed file systems, do needed type transformations, and then load the
data over the network into SQL Server on the 64P ES7000. The networking
architecture we chose to connect the clients and the server was to use a
dedicated 1 GbE link for each Unisys NUMA node, eight 1 GbE links in
total. To be able to track and manage things tightly, we established a clear mapping
between the physical Ethernet connections and the Layer 3 IP addresses on each
end, like this.
.jpg)
Figure 9: Logical
diagram of the networking architecture
A Gigabit Ethernet link connected each of the ccNUMA server
modules of the Unisys ES7000 with one of four clients, so each client had two
dedicated 1 Gb/s connections to the server. This arrangement eases
performance investigation work, and is not necessarily something we’re
suggesting for normal production deployments.
Extending Parallelism up the Stack with IntPolicy
We chose to extend our parallel, partitioned network
architecture up into the database server, by using a freely available Microsoft
tool, called IntPolicy, to bind NIC interrupts and DPCs [ref11] to specific
processors in the server. We wanted to be able to clearly observe CPU
utilization from network interrupts/DPCs, to be able to make adjustments during
our work. IntPolicy is available on Microsoft’s download center [ref12], and it
runs on Windows Server 2008 and Windows Server 2008 R2.
.jpg)
Figure 10:
Setting processor affinity using IntPolicy
This structured configuration was useful for controlling
NUMA locality but also allowed us to easily map performance issues end-to-end
through our client/server architecture.
The only drawback to using IntPolicy is that it’s a bit of a
pain to configure. You have to sort out your mapping between physical
interface, the network connection device name, and the physical device object
(bus/device/function #s), to be able to figure out which NIC’s interrupts/DPCs
should land on the desired CPU. This is not for normal people or the faint of
heart, but it isn’t rocket science either. Using ipconfig /all, devmgmt.msc,
and IntPolicy, it’s helpful to construct a table like this to keep the mappings
straight.
Interface | Margate IP | Node | Port L/R | Device Name NetConn | PCI Bus | PCI Device | PCI Func | New CPU Range | New DPC CPU |
P1 | 10.1.1.2 | 0 | R | #4 | 10 | 1 | 1 | 0-7 | 0 |
P2 | 10.1.2.2 | 1 | R | #12 | 67 | 1 | 1 | 8-15 | 8 |
P3 | 10.1.3.2 | 2 | R | #16 | 95 | 1 | 1 | 16-23 | 16 |
P4 | 10.1.4.2 | 3 | R | #10 | 123 | 1 | 1 | 24-31 | 24 |
P5 | 10.1.5.2 | 4 | R | #2 | 151 | 1 | 1 | 32-39 | 32 |
P6 | 10.1.6.2 | 5 | R | #14 | 179 | 1 | 1 | 40-47 | 40 |
P7 | 10.1.7.2 | 6 | R | #8 | 207 | 1 | 1 | 48-55 | 48 |
P8 | 10.1.8.2 | 7 | R | #6 | 235 | 1 | 1 | 56-63 | 56 |
Table 1: An
example of tracking the mapping between physical network interfaces and the
ES7000 CPU that would receive network interrupts/DPCs
Initial Tuning of the Network Configuration with NTttcp
After we had all of this cabled up and configured, we
decided to pre-flight the ETL experiments by validating the performance of the
network, using an internal tool called NTttcp [ref13]. Based on the same
concepts introduced in Mike Muuss’ original ttcp tool [ref14], NTttcp provides
some additional control features that take advantage of Windows kernel
features. We used NTttcp to pre-flight the basic performance of our network
before beginning runs of the ETLWR workload using SSIS and SQL Server.
A single-stream experiment with NTttcp showed some value in
running with L2 jumbo frames (9014-byte Ethernet frames), so we set that
through the Intel driver’s advanced configuration settings. Although we ended
up running the world record this way, later networking discoveries showed us we
probably could have stuck with the standard default setting (1500 bytes). To
keep performance complications to a minimum, we disabled NetBT (NETBIOS
support) and Windows Firewall on the private benchmark network interfaces.
In running NTttcp, we typically used the –v, –a, –fr, and –m
options on the receive side, to get: (1) verbose output, (2) async socket I/O
with a number of pending requests, (3) full buffer (not partial) receives, and
(4) to specify the number of threads and on which CPUs those threads should
execute the ttcp requests. For example:
ntttcpr.exe -v -a 6 -fr -m
2,1,10.1.1.2 2,9,10.1.2.2 2,17,10.1.3.2 2,26,10.1.4.2\
2,33,10.1.5.2 2,40,10.1.6.2 2,48,10.1.7.2 2,56,10.1.8.2
establishes two receive threads per CPU, on CPUs 1, 9, 17,
26, 33, 40, 48, and 56, and allows up to six outstanding async I/O requests per
thread. And on the send side, we used similar options, but obviously could skip
full receives:
ntttcps.exe -v -a 6 -m
2,1,10.1.1.2 2,5,10.1.2.2 2,9,10.1.3.2 2,13,10.1.4.2
In no time, we were running at line rate with a single
stream, so the next step was to run eight streams in parallel, one for each GbE
NIC pair between the systems. Right away we hit a problem. We could run with
two streams and get line rate, but as soon as we added a third, the other
streams would be impaired, dropping to a lower rate of throughput. As we
drilled into this issue, we found a fairly knotty, related set of issues that
involved the version of ACPI that this older server supported, the way NDIS
allocates NIC receive buffers, and NUMA fabric behavior. In the end, we cut
this Gordian knot by enabling full memory interleaving on the ES7000, and by
changing two BIOS settings, affecting cache prefetch behavior (thanks to
Unisys’ Bob Murphy). In the BIOS, we set Hardware Prefetcher Disable = No, and
Adjacent Cache Line Prefetch Disable = No.
With these changes, eight NTttcp streams (having 16 threads
on the receive-side), were able to sustain about 118 MB/s per 1 GbE flow,
when running all in parallel, which is very nearly full line rate.
.jpg)
Figure 11: Sample
output from an eight-stream run of NTttcp
This NTttcp CPU utilization meant we were consuming roughly
six of the eight CPUs we had dedicated for processing network traffic, which
isn’t bad at all considering the age of the Intel PRO/1000 MT LOM controllers
in the Unisys server.
Network Discoveries Running the Workload
With the lower level network configuration in place we were
ready to begin running scaled-down versions of the full application ETLWR
workload. We designed script automation and SSIS packages that could be run at
full scale, ingesting 1 TB of source data, and another version that worked
with 10% of the full source data. With these scale points and the help of a
tool called TCP Analyzer, we were able to identify and work around the final
obstacles to an under-30-minute run.
As we began experimental runs with the 10% data set, we
observed the 56 CPUs allocated to SQL Server for row processing were completely
pegged. To shift some of the CPU time away from TDS packet handling overhead,
and into row processing, we increased the TDS packet size from the default of
4k to the maximum value of 32k by reducing the number of packets exchanged
between client and server. This improved performance (that is, reduced load
time) by about 25%.
At this time we observed large variations in the networking
throughput during a test run. We had been using an SSIS package that loaded the
tables using a parallelized data flow within the package. To get clarity on the
source of the variation, we switched to running an SSIS package that loaded the
tables serially, which in turn helped us see the different run-time
characteristics of processing each of the different TPC-H tables. For instance,
during processing of one of the large, wide tables, such as LineItem,
networking bandwidth across the interfaces would drop as the server CPUs became
bottlenecked on processing the large rows. Narrower tables, like Orders, would
allow SSIS to drive all eight network interfaces at line rate.
Untangling the SSIS packages helped us really understand the
networking needs and behavior of processing each table, but we still saw gyrations
while SSIS worked on each table. At this point, we decided to pull out a new
Windows SDK tool, called TCP Analyzer, to figure out what was happening. This
application, which was written by the Windows core networking team, allows you
to enable the new TCP/IP stack’s support for IETF TCP Extended Statistics
(ESTATS) MIB, to see in real time what’s going on with a specific TCP flow. We
had been experimenting with changing the number of L2 NIC buffers allocated for
each adapter, but one look at TCP Analyzer showed we had been barking up the
wrong tree (thanks to Murari Sridharan for the TCP Analyzer tips).
.jpg)
Figure 12: TCP
Analyzer - using default TCP receive window auto-tuning
This TCP Analyzer screenshot, taken during an instance of a
10% ETLWR run, shows what was happening for a single TCP flow, between SSIS on
the sender (where TCP Analyzer is best run), and the SQL Server receiver.
Looking at the throughput history line graph you can see the bandwidth
gyrations. The real eye-opener was watching this run in real time and seeing
how much time this TCP flow spent in recovery from congestion. Fortunately, TCP
Analyzer also maintains a connection lifetime pie chart that breaks this down,
so you can see that a considerable amount of time during this 10% ETLWR run had
been spent waiting for the TCP path to become uncongested!
We had been assuming that there were no TCP-level issues,
and we had been experimenting with allocating smaller or bigger numbers of L2
NIC buffers. We were clearly barking up the wrong tree, as a set of experiments
showed. In fact, giving the NIC driver more L2 send buffers, although they used
the default TCP auto-tuning algorithm, actually made performance worse, because
the sender could even more easily overwhelm the SQL Server receiver, which
today has some limitations around the number of outstanding receive requests
that it keeps in flight. The new networking stack supports three TCP
auto-tuning levels: default, disabled (no auto-tuning of TCP’s receive window),
and restricted (which auto-tunes TCP’s receive window, but less aggressively
than the default setting).
Running a few experiments was interesting, not just for the
bandwidth and run-time results we got, but also for the stability of data
movement. Just switching TCP auto-tuning to restricted, we were able to get a
10% increase in performance. Changing the number of NIC auto-tuning buffers
made a slight difference, but the change was within our run-to-run variation,
so probably not really significant. Table 2 shows results comparing ETLWR
performance with changes to the TCP receive window auto-tuning setting, and the
number of NIC transmit buffers configured per Gigabit Ethernet adapter.
Run-type | NIC # TxBufs | TCP Rx Autotuning | Runtime (secs.) | Runtime Delta | Comments |
56 Streams 10% data | 512 | Normal | 233 | — | Baseline |
56 Streams 10% data | 128 | Normal | 215 | 7.7% | Reduce # of NIC buffers |
56 Streams 10% data | 128 | Disabled | 231 | 0.9% | Disable TCP Rx auto-tuning |
56 Streams 10% data | 128 | Restricted | 208 | 10.7% | Try Restricted TCP auto-tuning |
56 Streams 10% data | 256 | Restricted | 207 | 11.2% | Increase # of NIC buffers |
Table 2: Effect
of TCP receive window auto-tuning and number of NIC transmit buffers on ETLWR
performance
Before running each experiment, we simply used the built-in
netsh tool to change TCP receive auto-tuning, like this:
netsh int tcp set global
autotuninglevel=restricted
An even bigger win was in the improvement in network
bandwidth stability, with Restricted
TCP auto-tuning. Another TCP Analyzer screenshot tells the whole story.
.jpg)
Figure 13: TCP
Analyzer - using “restricted” TCP receive window auto-tuning
You can see that the flow’s average sending rate is higher,
but more importantly, no time is spent waiting on the sender or waiting for the
path to become uncongested, and the flow’s throughput is exceptionally stable.
At this point, we were receiver bound, which is what we expected. It’s worth
mentioning that TCP/IP’s default auto-tuning behavior could improve in later
releases of Windows, so be careful assuming too much.
As we got down to short strokes for this effort, TCP
Analyzer came in handy one more time, though the root cause of the problem
wasn’t really a networking issue. After overcoming the TCP auto-tuning problem,
we converged quickly to within striking distance of our goal, but it still
danced just out of reach because some CPUs on the database server exhibited a
“long tail” at the end of the run. (A long tail is a large number of unique
items, each in relatively small quantities.) We had dedicated some CPUs to
networking to avoid exactly this kind of an outcome, and yet here it was. We
studied the perfmon and SQL Server statistics on the database server, but could
not figure it out.
Then one evening (or early morning, more likely), we thought
to hook up TCP Analyzer to a networking flow associated with one of the
long-tail CPUs. The long tails happened consistently, and not randomly, so this
was easy to do (and also a bit fishy by itself). Once again, TCP Analyzer
showed us we were looking in the wrong place.
.jpg)
Figure 14: TCP
Analyzer - long-tail SSIS streams were sender-bound
You can see that the networking stream, associated with a
long-tail SSIS stream, was sender-limited, not receiver-bound. As soon as we
shifted our attention back to the clients, we noticed a small but significant
anomaly in the disk subsystem of one of the client machines, which had been
caused by a fragmented source file system. Fixing that problem got us across
the finish line!
Configuration Details
Database Server
Make: Unisys
Model: ES7000/one
Enterprise Server
OS: Windows
Server 2008 x64 Datacenter
CPU: 32
socket dual core Intel® Xeon 3.4 GHz (7140M)
RAM: 256 GB
HBA: 8
dual port 4Gbit FC
Database: Prerelease
build of SQL Server 2008 Enterprise (V10.0.1300.4)
Storage: EMC
Clariion CX3-80 (Qty 1)
11
trays of 15 disks; 165 spindles x 146 GB 15Krpm; 4Gbit FC
SSIS Servers
Quantity: 4
Make: Unisys
Model: ES3220L
OS: Windows
2008 x64 Enterprise
CPU: 2
socket quad core Intel Xeon processors @ 2.0GHz
RAM: 4 GB
HBA: 1
dual port 4Gbit Emulex FC
NIC: Intel
PRO1000/PT dual port
Database: Prerelease
build of SQL Server 2008 Integration Services (V10.0.1300.4)
Storage: 2x
EMC CLARiiON CX600 (ea: 45 spindles, 4 2 Gbit FC)
Conclusion
Setting the ETL World Record is simultaneously very
important and unimportant. It is important because it clearly shows that SSIS
is among the world leaders in ETL performance. In doing this exercise, we have
illustrated techniques that can be applied to achieve outstanding performance.
These techniques can be applied by customers today. We have also learned things
that can be used to build a better product. Some of these things have been
implemented; some will come in following releases.
The ETL World Record, like any benchmark, is unimportant if
the workload does not bear some resemblance to what you as a customer need to
do. ETL spans a broad spectrum of operations, many of which were not included
in this benchmark. While we followed the example that has been set in the
marketplace, we also look forward to industry-standard benchmarks that would
better represent customer needs. Microsoft has joined with other industry
leaders in the Transaction Processing Performance Council (TPC) to develop a
standard ETL benchmark. At the same time, we wish to reiterate that the techniques
described here are useful in many situations. We have demonstrated that SSIS
can run like a race horse.
Acknowledgments
The authors wish to thank our partners at Unisys for their
assistance and advice, especially Henk van der Valk, who spent many days and
nights in the lab working with us to perfect the system.
The authors also wish to thank:
- Intel, for lending the PRO/1000 PT network
interface cards that worked so well in the client systems.
- Ahmed Talat, for advice on NTttcp and IntPolicy.
- Bob Murphy, of Unisys, for his ideas on BIOS
cache configuration settings.
- Murari Sridharan, for the TCP Analyzer tips.
For more information:
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this
paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to
5 (excellent), how would you rate this paper and why have you given it this
rating? For example:
- Are
you rating it high due to having good examples, excellent screen shots, clear
writing, or another reason?
- Are
you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback
will help us improve the quality of white papers we release.
Send feedback.
References
[ref01] This
layout may have been more than needed. In later runs we repeated the result
with 8 data volumes and 16 file groups.
[ref02] To
learn about SQL Server 2008 table partitioning, see http://msdn.microsoft.com/en-us/library/ms188706.aspx
[ref03] SQL Server 2008 Books Online. Understanding Non-uniform Memory Access.http://msdn.microsoft.com/en-us/library/ms178144.aspx
[ref04] Note
that these registry entries create nodes on all 64 cores and assign a port
number to each, even though eight of the cores are reserved for network DPC
processing. We could have left those cores unmapped to ports; we simply chose
to control the selection of processors used for SQL Server processing from the
SSIS packages rather than enforcing it on the database server.
[ref05] SQL
Server 2008 Books Online. Database File
Initialization.http://msdn.microsoft.com/en-us/library/ms175935.aspx
[ref06] Intel
Corporation. Advanced Configuration &
Power Interface (ACPI).http://developer.intel.com/technology/iapc/acpi
[ref07] Intel
Corporation. Optimizing Application
Performance on Intel® Core™ Microarchitecture Using Hardware-Implemented
Prefetchers. http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers/
[ref08] MSDN®. Enabling
Memory Support for Over 4 GB of Physical Memory.http://msdn.microsoft.com/en-us/library/ms190730.aspx
[ref09] MSDN. How to:
Enable the Lock Pages in Memory Option (Windows).http://msdn.microsoft.com/en-us/library/aa366720.aspx
[ref10] Technically abbreviated ccNUMA, but the cc prefix is
often not used, and this is what people typically mean by NUMA.
[ ref11] Deferred Procedure Calls (DPCs) are the method used
by the Windows family of server operating systems to process network traffic,
outside of the critical hardware interrupt processing window.
[ref12] Windows Hardware Developer Central. Interrupt-Affinity Policy Tool.http://www.microsoft.com/whdc/system/sysperf/intpolicy.mspx
[ref13] Windows Hardware Developer Central. How to Use NTttcp to Test Network
Performance.http://www.microsoft.com/whdc/device/network/TCP_tool.mspx
[ref14] Michael John Muuss. The Story of the TTCP Program.http://ftp.arl.mil/~mike/ttcp.html