Distribution Statistics Requirements for OLE DB Providers

Article
12/03/2008

To improve the optimization of distributed queries, SQL Server 2005 defines extensions to the OLE DB specification that OLE DB providers can use to report distribution statistics on the rowsets, or tables, that they expose. Although these extensions are defined in the SQL Server documentation, individual OLE DB provider developers must code support for the extensions in their providers if they want to make the information available to SQL Server. If a provider has code that supports the extensions, SQL Server can use the extensions to optimize the performance of distributed queries. If a provider does not support the extensions, SQL Server uses simple estimates of the distribution statistics.

Note

The Microsoft SQL Native Client OLE DB Provider and the Microsoft OLE DB Provider for Oracle support distribution statistics.

The distribution statistics extensions are built around a unit called a statistic. Each table can have zero or more statistics, and each statistic reports data for one or more columns. A statistic records the following:

The cardinality of the values, or the number of unique values, in each column covered by the statistic.
The cardinality of the concatenated values of all the columns covered by the statistic.
Optionally, a histogram reporting information about different ranges of key values in the first column covered by the statistic. The values reported can include the number of rows in each key range, the number of unique values in each key range, or the number of rows in the table whose key values are less than or equal to the highest key value in the range.

Here is an example table.

ColumnA	ColumnB
'abc'	'xyz'
'abc'	'xyz'
'def'	'xyz'
'mno'	'xyz'
'mno'	'mmm'
'tuv'	'xyz'

For a statistic that covers ColumnA and ColumnB, the cardinality of the combined values of the two columns is 5. This means there are 5 unique combinations of values for ColumnA and ColumnB because the first two rows have the same combined value ('abc' + 'xyz').

The cardinality of ColumnA alone is 4 and the cardinality of ColumnB alone is 2. A simple, four-step histogram on ColumnA could report.

Value range	Percentage of table rows in the range
'aaa' to 'hzz'	50%
'iaa' to 'nzz'	33%
'oaa' to 'rzz'	00%
'taa' to 'zzz'	17%

Different OLE DB data sources record distribution statistics on different combinations of columns, and the set of statistics reported by an OLE DB provider is implementation defined. For example, SQL Server versions 6.5 and earlier build distribution statistics only for columns covered by indexes and have one statistic for each index defined on a table. SQL Server version 7.0 and later builds the following statistics:

One statistic for each index defined on a table.
One statistic for each CREATE STATISTIC statement.
One statistic for each statistic that is automatically generated.

A column has a high degree of selectivity if it is likely to return a small number of rows for a particular value specified in a predicate argument. The distribution statistics can be used to estimate the degree of selectivity:

Columns with high cardinality have more data values, and each data value is likely to match a smaller number of rows than a column with low cardinality.
If an OLE DB provider provides a histogram reporting how the values are distributed in a column, the SQL Server optimizer can also estimate if the specific value in a predicate argument is in a range that has good or poor selectivity.

Having good distribution statistics for a linked server can also help the optimizer build an efficient execution plan for the local part of a distributed query.

The SQL Server optimizer uses the distribution statistics in an attempt to reduce the amount of data that must be communicated between the OLE DB provider and SQL Server. For example, when performing a distributed join between TableA on the local server and TableB on a linked server, SQL Server can use the distribution statistics to determine which of these processes is most efficient:

Send the rows from TableA that match nonjoin predicates to the linked server and have the linked server perform the join.
Retrieve the rows from TableB that match nonjoin predicates to the local server and perform the join on the local server.

If an OLE DB provider does not report cardinality information about a column, the SQL Server optimizer estimates a low cardinality. If a provider does not report a distribution histogram for a statistic, the optimizer operates as if the values are evenly distributed in the rows of the table.

SQL Server uses the following extensions from OLE DB providers to report distribution statistics:

A data source property, DBPROP_TABLESTATISTICS, indicates w the provider reports distribution statistics.
An IDBSchemaRowset, TABLE_STATISTICS, lists the statistics available for a specified base table. This includes column and row cardinality.
IOpenRowset::OpenRowset accepts arguments identifying a statistic. When a statistic is specified, OpenRowset returns a histogram rowset that shows the distribution of values in the first column covered by the statistic specified in StatisticID.

These extensions to OLE DB are included in OLE DB version 2.6 or later. For information about these extensions regarding distribution statistics, see the OLE DB 2.6 specification.

An OLE DB provider can choose to implement a performance enhancement of sampling only a part of the rows in a base table to determine the distribution statistics and histograms. These providers should scale their cardinality and histogram data to reflect the total values for the table before reporting them in the TABLE_STATISTICS and histogram rowsets.

Whether an OLE DB provider keeps the data in the TABLE_STATISTICS and the histogram rowset up to date with the current contents of the base table is implementation defined.

Distribution Statistics Requirements for OLE DB Providers

See Also

Concepts

Other Resources

Help and Information

Additional resources